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Preface 



This volume contains all the papers presented at the Ninth International Confe- 
rence on Algorithmic Learning Theory (ALT’98), held at the European education 
centre Europaisches Bildungszentrum (ebz) Otzenhausen, Germany, October 8- 
10, 1998. The Conference was sponsored by the Japanese Society for Artificial 
Intelligence (JSAI) and the University of Kaiserslautern. 

Thirty-four papers on all aspects of algorithmic learning theory and related 
areas were submitted, all electronically. Twenty-six papers were accepted by the 
program committee based on originality, quality, and relevance to the theory of 
machine learning. Additionally, three invited talks presented by Akira Maruoka 
of Tohoku University, Arun Sharma of the University of New South Wales, and 
Stefan Wrobel from GMD, respectively, were featured at the conference. We 
would like to express our sincere gratitude to our invited speakers for sharing 
with us their insights on new and exciting developments in their areas of research. 

This conference is the ninth in a series of annual meetings established in 
1990. The ALT series focuses on all areas related to algorithmic learning theory 
including (but not limited to): the theory of machine learning, the design and 
analysis of learning algorithms, computational logic of/for machine discovery, 
inductive inference of recursive functions and recursively enumerable languages, 
learning via queries, learning by artificial and biological neural networks, pattern 
recognition, learning by analogy, statistical learning, Bayesian/MDL estimation, 
inductive logic programming, robotics, application of learning to databases, and 
gene analyses. 

The variety of approaches presented in this and the other ALT proceedings 
reflects the continuously growing spectrum of disciplines relevant to machine 
learning and its applications. The many possible aspects of learning that can be 
formally investigated and the diversity of viewpoints expressed in the technical 
contributions clearly indicate that developing models of learning is still particu- 
larly important to broaden our understanding of what learning really is, under 
which circumstances it can be done, what makes it feasible and complicated, re- 
spectively, and what are appropriate tools for analyzing it. This ALT conference 
as well as its predecessors aimed to extend and to intensify communication in 
the continuously growing scientific community interested in the phenomenon of 
learning. 

Starting this year, the ALT series will further endeavor to bring both the 
theoretical and the experimental communities under one umbrella by organizing 
a satellite workshop on applied learning theory before or after the annual meeting. 
Ideally, people in theory should benefit by learning about challenging research 
problems which arose in practice, and people in application may benefit by 
getting answers to their problems from theoreticians. Putting these two activities 
under the unifying ALT logo promted us to rename the ALT series into annual 
conference on Algorithmic Learning Theory. 
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Preface 



The continuing success of these ALT meetings has been managed and su- 
pervised by its steering committee consisting of Setsuo Arikawa (Chair, Kyushu 
Univ., Fukuoka), Takashi Yokomori (Waseda Univ., Tokyo), Hiroshi Imai (Univ. 
of Tokyo), Teruyasu Nishizawa (Niigata Univ.), Akito Sakurai (JAIST, Tokyo), 
Taiske Sato (Tokyo Inst. Technology), Takeshi Shinohara (Kyushu Inst. Tech- 
nology, lizuka), Masayuki Numao (Tokyo Inst. Technology), and Yuji Takada 
(Fujitsu, Fukuoka). 

ALT’98 was chaired by Michael M. Richter (University of Kaiserslautern) and 
co-chaired by Carl H. Smith (University of Maryland, College Park). The local 
arrangements chair was Edith Hiittel (University of Kaiserslautern) . 

We would like to express our immense gratitude to all the members of the 
program committee, which consisted of: 

P. Bartlett (ANU, Australia) 

S. Ben-David (Technion, Israel) 

S. Dzeroski (Jozef Stefan Institute, Slovenia) 

R. Gavalda (Univ. de Catalunya, Spain) 

L. Hellerstein (Polytechnic Univ., USA) 

S. Jain (National Univ. Singapore) 

S. Lange (Univ. Leipzig, Germany) 

M. Li (Univ. Waterloo, Canada) 

H. Motoda (Osaka Univ., Japan) 

Y. Sakakibara (Tokyo Denki Univ., Japan) 

K. Satoh (Hokkaido Univ., Japan) 

T. Shinohara (Kyutech, Japan) 

E. Ukkonen (Helsinki Univ., Finland) 

R. Wiehagen (Univ. Kaiserslautern, Germany, Chair) 

T. Zeugmann (Kyushu Univ., Japan, Co-Chair) 

They and the subreferees they enlisted put a huge amount of work into reviewing 
the submissions and judging their importance and significance. 

We would like to thank everybody who made this meeting possible: the aut- 
hors for submitting papers, the invited speakers for accepting our invitation, the 
local arrangement chair Edith Hiittel, the ALT steering committee, the spon- 
sors, IFIP Working Group 1 .4, for providing a student scholarship, and Springer- 
Verlag. Furthermore, the program committee heartily thanks all referees who are 
listed on a separate page for their hard work. We also gratefully acknowledge 
Shinichi Shimozono’s contribution in helping to produce the ALT’98 logo, and 
Masao Mori’s assistance for setting up the ALT’98 Web pages. 
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Editors’ Introduction 



Learning Theory emerged roughly forty years ago, and some of the pioneering 
work is still influential, e.g., Gold’s (1967) paper on Language Identification in 
the Limit} Despite the huge amount of research invested in this area, the state 
of the art in modeling learning is still much less satisfactory than in other areas 
of theoretical computer science. For example, around 60 years ago computability 
theory emerged. Initially, many different models have been introduced, e.g., Tu- 
ring machines, partial recursive functions, and Markov algorithms. Nevertheless, 
later on, all those models have been proved to be equivalent. 

The situation in algorithmic learning theory is, however, quite different. Nu- 
merous mathematical models of learning have been proposed during the last 
three decades. Nevertheless, different models give vastly different results concer- 
ning the learnability and non-learnability of objects one wants to learn. Hence, 
finding an appropriate definition of learning which covers most aspects of lear- 
ning is also part of the goals aimed at in algorithmic learning theory. Additio- 
nally, it is necessary to develop a unified theory of learning as well as techniques 
to translate the resulting theories into applications. On the other hand, ma- 
chine learning has also found interesting and non-trivial applications but often 
our ability to thoroughly analyze the systems implemented does not keep up 
proportionally. 

Moreover, nowadays the data collected in in various fields such as biology, 
finance, retail, astronomy, medicine are extremely rapidly growing, but our abi- 
lity to discover useful knowledge from such huge data sets is still too limited. 
Clearly, powerful learning systems would be an enormous help in automatically 
extracting new interrelations, knowledge, patterns and the like from those and 
other huge collections of data. Thus, there are growing challenges to the field of 
machine learning and its foundations that require further efforts to develop the 
theories needed to provide, for example, performance guarantees, to automize 
the development of relevant software and the like. 

Each learning model specifies the learner, the learning domain, the source of 
information, the hypothesis space, what background knowledge is available and 
how it can be used, and finally, the criterion of success. For seeing how different 
models can arise by specifying these parameters, we shall outline throughout 
this introduction different possibilities to do so. While the learner is always an 
algorithm, it may also be restricted in one way or another, e.g., by requiring it to 
be space and/or time efficient. The different learning domains considered range 
from learning unknown concepts, such as table, chair, car, different diseases and 
so on. What is then aimed at is learning a “rule” to separate positive examples 
from negative ones. For example, given a description of symptoms, the “rule” 
learned must correctly classify whether or not a particular disease is present. 

In his invited lecture, Maruoka (co-authored by Takimoto) is studying struc- 
tured weight-based prediction algorithms. The prediction algorithms considered 



^ Information and Control 10:447-474. 
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have a pool of experts, say Ei, . . . , that at each trial, on input any member 
of the underlying domain, outputs either zero or one. That is, the experts per- 
form a classification task as described above. The algorithm somehow combines 
these predictions and outputs its classification. Afterwards, the true classifica- 
tion is received, and the next trial starts. If the output made by the learner 
has been false, a prediction error occurred. When making a prediction error, the 
learner gets a penalty that is expressed by a suitably defined loss function. The 
learning goal consists in minimizing the loss function. The experts are assigned 
weights which are updated after each trial. While previous work considered the 
experts to be arranged on one layer only, the present paper outlines interesting 
ways to arrange the experts on a tree structure. As a result, the expert model 
can be applied to search for the best pruning in a straightforward fashion by 
using a dynamic programming scheme. 

Learning Logic Programs and Formulae 

Inductive logic programming and learning logic programs constitutes a core 
area of the ALT meetings, too. The basic scenario can be described as follows. 
The learner is generally provided background knowledge B as well as (sequences 
of) positive (A+) and negative examples {E~) (all of which can be regarded as 
logic programs, too). However, E~^ and E~ usually contain only ground clauses 
with empty body. The learning goal consists in inferring a hypothesis (again 
a logic program) such that E~^ can be derived from BAH while B A H A E~ 
is contradiction free. Again, if growing initial segments of E~^ and E~ are pro- 
vided, we arrive at learning in the limit. Variations of this model include the 
use of membership queries to obtain E~^ and E~ and of equivalence queries (or 
disjointness queries) to terminate the learning process. Recently, various authors 
looked also at different possibilities to formulate queries, and we shall describe 
some of them when talking about the relevant papers in these proceedings. Al- 
ternatively, E~^ and E~ may be drawn randomly with respect to some unknown 
probability distribution, and the learner is required to produce with high con- 
fidence a hypothesis that has small error (both measured with respect to the 
underlying probability distribution). 

In his invited lecture, Wrobel addresses scalability issues in inductive logic 
programming. This research is motivated by demands in knowledge discovery in 
databases. Very often, the databases available store the information in relational 
database management systems. On the other hand, most currently used know- 
ledge discovery methods rely on propositional data analysis techniques such as 
decision tree learners (C4.5), or propositional rule learning methods (e.g. CN2). 
While these methods have the benefit of being efficient in practice, they cannot 
directly deal with multiple relations that are typical for relational databases. 
Inductive logic programming is much better suited for such purposes, but until 
recently it lacked the efficiency to deal with huge amount of data. Scalability 
turned out to be very important in this regard, and Wrobel’s paper presents the 
progress made recently. 
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In their invited paper McCreath and Sharma present Lime: a system for lear- 
ning relations. Their inductive logic programming system induces logic programs 
from ground facts. This is done via a Bayesian heuristic, where explosion of the 
search space is tamed by exploiting the structure of the hypothesis space. Prior 
probabilities are assigned by applying Occam’s razor, i.e., simpler hypotheses get 
higher probability. Favoring the Bayesian approach leads to another peculiarity. 
In contrast to a variety of other approaches. Lime is not growing a clause one 
literal at a time. Instead, it builds candidate clauses from groups of literals. The 
clauses obtained can be efficiently combined for obtaining new clauses such that 
the coverage of the resulting clause is just the intersection of the coverage of 
the clauses it is built from. The overall result is a learning system for the fixed 
example size framework that is, as experiments show, particularly good when it 
has to learn recursive definitions in the presence of noise. 

Krishna Rao and Sattar present a polynomial time learning algorithm for a 
rich class of logic programs, thereby considerably extending and (partially cor- 
recting) results obtained by Arimura.^ The information source are equivalence, 
subsumption and request-for-hint queries. Input to a subsumption query is a 
clause C, and it is answered “yes” iff C is a tautology or H* ^ C, where H* de- 
notes the target concept. Otherwise, the answer is just “no.” A request-for-hint 
query takes as input a ground clause, and answers “yes” provided C is subsu- 
med by H*. Otherwise, the reply is “no” and a hint, i.e., an atom along with 
a suitable substitution that can be refuted from target and the body of ground 
clause is returned. As a matter of fact, all these queries can be answered in time 
polynomial in the length of the target and C. The main new feature included 
in their article is the target class of finely-moded logic programs that allow to 
include local variables. Moreover, background knowledge previously learned is 
incrementally used during the learning process. 

Besides ILP and the techniques developed within this framework, there is 
another major line of research that conceptually fits into the setting of learning 
logic programs, i.e., learning subclasses of concepts expressible by elementary 
formal systems (abbr. EFS). This year, Sugimoto is continuing along this line of 
research by extending the EFS’s to linearly-moded EFS’s. Again, the main new 
feature is the inclusion of local variables that turned out to be important to define 
translations over context-sensitive languages. The main goal is then the design 
of an efficient learner for such translations from input-output sentences, i.e., 
positive examples only. This goal is partially achieved by providing an algorithm 
that learns the whole class of translations definable by linearly-moded EFS’s 
such that the number of clauses in the defining EFS’s and the length of each 
clause are bounded by some a priori fixed constant. A natural generalization of 
this result would be bounding the length of the output sentences by a constant 
multiple of the input length. However, the resulting class of translations is not 
learnable from data at all. 



^ H. Arimura, Learning acyclic first-order Horn sentences from entailment, in 
Proc. ALT’97, Lecture Notes in Artificial Intelligence 1316, pp. 432-445. 




4 



Editor’s Introduction 



Yamamoto provides a rigorous analysis of several learning techniques develo- 
ped in the area of machine learning such as saturant generalization, bottom ge- 
neralization, Y*-operation with generalization and inverse entailment. The main 
goal is to present a unifying framework from which all these methods can be ob- 
tained as instances. Within this framework obtained the methods are compared 
to one another, e.g. with respect to their inference power. 

Satoh is dealing with case-based represent ability of Boolean function. Intui- 
tively, a case base contains knowledge already obtained. If a new task has to 
be handled, i.e., one that is not in the case base, one looks for a case that is 
similar, where the similarity is measured by an appropriate measure. Finding 
good similarity measures has attracted considerable attention. Looking at a si- 
milarity measure that is based on set inclusion of different attributes in a case, 
Satoh establishes nice connections to the monotone theory initiated by Bshouty 
in 1993. 

Finally, Verbeurgt addresses the problem of learning subclasses of monotone 
DNF. Learning DNF efficiently is for sure of major interest but apparently very 
hard. The learning model considered is a variant of the Valiant’s PAC-learning 
model. In this model, examples are drawn randomly, and with high confidence, 
one has to find a hypothesis that has only a small error. Here, the error is mea- 
sured by summing up all probabilities for elements in the symmetric difference 
of the target and the hypothesis output. Usually, the learner has to succeed for 
every underlying probability distribution. The version considered by Verbeurgt 
restricts the class of probability distributions to the uniform distribution. He 
extends previous results on read-once DNF within this model by refining the 
Fourier analysis methodology. 

Learning Formal Langnages 

In this setting, the learning domain is the set of all strings over some fixed 
finite alphabet, and the different objects to be learned are subsets of strings. 
The source of information may be augmenting initial segments of sequences ex- 
hausting all strings over the underlying alphabet that are classified with respect 
to their containment in the target language (learning from informant), or just 
growing initial segments of sequences containing eventually all the strings that 
belong to the target language (text, or positive data). The learner has then to 
map finite sequences of strings into hypotheses about the target language. The 
investigation of scenarios in which the sequence of computed hypotheses stabili- 
zes to a correct finite description (e.g., a grammar, an acceptor, a pattern) of the 
target has attracted much attention and is referred to as learning in the limit. 
Instead of requesting the learner to converge syntactically one can also consi- 
der semantical convergence, i.e., beyond some point exclusively hypotheses are 
output that generate the same language. The latter scenario is usually referred 
to as behaviorally correct learning. As for the present volume, there are several 
papers dealing with these models. 

Head et al. study the learnability of subclasses of regular languages. While 
the whole class of regular languages has been known to be non-inferable from 
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positive data, certain subclasses are. In particular, they pinpoint the common, 
essential properties of the previously unrelated frameworks of reversible langua- 
ges and locally testable languages by defining and studying equivalence relations 
over the states of finite automata. If appropriately defined, both language clas- 
ses emerge. The learnability result is based on Angluin’s characteristic sample 
technique and all definable classes are shown to be learnable. Then, finally, the 
authors show that even the whole class of regular languages is approximately 
identifiable by using any of the defined classes as hypothesis space. That is, in- 
stead of synthesizing an acceptor correctly describing the target language, the 
best fit from the hypothesis space is learned. 

Case and Jain consider indexed families of uniformly recursive languages, i.e., 
language classes that are recursively enumerable in a way such that the mem- 
bership problem is uniformly decidable for all languages enumerated. Now, given 
as input any index (or a program) generating any such class, they address the 
problem of whether or not a learner, if there is any, can be algorithmically syn- 
thesized that learns the whole class even from noisy data. Noisy data are defined 
along the model introduced by Stephan in his ALT’95 paper. Roughly speaking, 
in this model correct data occur infinitely often while incorrect data are presen- 
ted only finitely many times. The new restriction made is that the noisy data 
are computable. Then, the main positive result obtained is very strong: gram- 
mars for each indexed family can be learned from computable noisy positive 
data within the framework to converge semantically. And these learners can all 
be synthesized. Thus, there is a huge gap of what can be learned in a completely 
computable universe and from arbitrary positive data. In particular, these re- 
sults show that additional background knowledge can considerably enhance the 
learning capabilities. 

In a sense, Stephan and Ventsov study the same problem though in a diffe- 
rent context, i.e., whether or not background knowledge may help in learning 
(here called semantical knowledge). Now, the language classes are defined via 
algebraic structures (e.g., monoids, ideals of a given ring, vector spaces) and 
the background knowledge is provided in the form of programs for the under- 
lying algebraic operations. What is shown is that such background knowledge 
can improve both, the overall learning power as well as the efficiency of learners 
(measured in the number of mind changes to be performed). Finally, a pure 
algebraic notion is characterized in terms of pure learning theory. A recursive 
commutative ring is Noetherian iff the class of its ideals is behaviorally correct 
learnable from positive data. 

But there are more ways to attack the problem of how additional knowledge 
may help. In her ALT’95 paper, Meyer has observed that in the setting of learning 
indexed families from positive data, probabilistic learning under monotonicity 
constraints is more powerful than deterministic learning. A probabilistic learner 
is allowed to flip a coin each time it reads a new example, and to branch its 
computation in dependence on the outcome of the coin flip. The monotonicity 
constraints formalize different versions of how to realize the subset principle to 
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avoid overgeneralization, and these formalizations go in part back to Jantke’s pa- 
per at the very first ALT meeting in 1990. This year, Meyer asks what knowledge 
is necessary to compensate the additional power of probabilistic learners. Now, 
knowledge is provided in form of oracles, and instead of flipping a coin, the deter- 
ministic learner may ask the oracle A a membership query, i.e., “a; G A?,” where 
X depends on the examples received so far. For getting a flavor of the results 
obtained, we just mention two. First, if probabilistic learners under monotonicity 
constraints are requested to learn with probability (p > 2/3 and p > 1/2 in de- 
pendence on the particular type of the monotonicity constraint) then the oracle 
for the halting problem does suffices to compensate the power of probabilistic 
learning. However, these bounds are tight, i.e., if p = 2/3 (p = 1/2), then the 
oracle for the halting problem is too weak. 

Certain classes of languages are not inferable from positive data among them 
the regular languages and any superset thereof. This result goes back to Gold’s 
(1967) paper, where he showed that every language class containing at least one 
infinite language and all finite languages is not learnable from positive data. 
Thus, one may think that there are no interesting languages classes at all that 
can be inferred from positive data. However, this is a misleading impression. 
The most prominent counterexample are the pattern languages introduced by 
Angluin in 1980. Patterns are a very natural and convenient way to define langu- 
ages. Take some constant symbols and symbols for variables. Then every finite 
non-null string over these symbols constitutes a pattern, e.g. axibbx\. The lan- 
guage generated by a pattern tt is the set of all strings that can be obtained by 
substituting constant strings for the variables occurring in tt. A pattern is said 
to be regular if every variable symbol occurs at most once in it. 

Sato et al. study the learnability of the language class RP^ that can be 
obtained by taking the union of at most k regular pattern languages, where k 
is a priori fixed. This class has be shown to be learnable from positive data by 
Wright in 1989. Thus, there are characteristic samples for all these languages, 
i.e., for every language L there is a finite sets SQL such that SQL' implies 
L Q L' for every L' G These characteristic samples play an important role 
in the design of learning algorithms. The present paper shows that there is a 
simple way for getting such characteristic samples by taking all substitutions of 
size 1 (or 2) provided there are at least 2 A: -I- 1 {2k — 1) many constants. 

Sakamoto studies the versions of the consistency problem for one- variable pat- 
tern languages that may be interesting when learning from noisy data is required. 
The problem is, given a set of positive and negative examples, respectively, does 
there exist a one-variable pattern generating all positive examples and none of 
the negative ones. The new idea introduced is that the given strings may contain 
wild cards. In particular, he shows the consistency problem to be NP-complete 
provided the pattern must separate the set of positive and negative examples for 
all possible replacements of the wild cards by constant symbols. 

In all the papers mentioned above, the learner has been required to behave 
appropriately when getting data for the languages to be learned. This also im- 
plied that some prior knowledge is provided in the form of a suitable hypothesis 
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space. However, looking at applications such as data mining, it is well conceivable 
that we do not have such prior knowledge. Thus, one may just guess a hypothe- 
sis space. But then the question arises of how the learner behaves when getting 
data for a target that has no representation within the guessed hypothesis space. 
This question is meaningful as long as the hypothesis space guessed is not an ac- 
ceptable programming system but rather, let’s say, an indexed family. The idea 
that the learner must be able to refute the hypothesis space has been introduced 
by Mukouchi and Arikawa in their 1993 ALT paper. Subsequently, Lange and 
Watson (ALT’94) modified this approach by requesting that the learner must be 
able to refute initial segments of data sequences that do no correspond to any 
language in the hypothesis space. This year, Jain is continuing along this line of 
research. Instead of looking at indexed families as in previous work, he considers 
general classes C of recursively enumerable languages. Now, allowing the class 
of all computer programs as hypothesis space, one can still insist to refute all 
initial segments of texts (informants) that do not correspond to any language 
in C. Alternatively, one may also allow the learner to either refute or identify 
them. Finally, one may require the learner to refute only initial segments of texts 
(informants) that it cannot learn. Surprisingly, the latter approach is the most 
powerful one, and, for learning from text, it also achieves the whole learning 
power of learning in the limit from positive data. 

There is one more paper dealing with language learning, but is different from 
all the papers mentioned above in that it uses queries to gain information about 
the target objects to be learned. Therefore, we discuss it within the next section. 

Learning via Queries 

We can imagine the learning via queries scenario as the interaction between 
a teacher and a learner that can communicate using some prespecified query 
language. For example, when learning the concept of a chair, the learner my ask 
“is a sofa an example of a chair?, ” or when learning a language, a query may 
be of the form “is w a string from the target language?” This type of question is 
referred to as a membership query. Alternatively, one can allow the learner to ask 
“is G a grammar for the target language?” The latter type of question is called 
equivalence query, and is easy to see how to generalize it to any learning domain. 
Clearly, a positive answer to an equivalence directly yields the information that 
the target has been learned correctly, and the learner can (and has to) stop. 
If the answer is negative, usually a counterexample is returned, too, i.e., an 
element from the symmetric difference of the target and the object described 
by the equivalence query. In general, whatever the query language is, now the 
learner is required to halt and to output a correct description for the target. 
Moreover, it is also easy to see that every indexed family can be learned from 
equivalence queries alone. However, this may require a huge amount of queries, 
and thus may be beyond feasibility. Therefore, within the learning via queries 
scenario, the query complexity is usually of major concern. What one usually 
wants is that the overall number of queries asked is polynomially bounded in the 
length of the target and the longest counterexample returned. 
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Melideo and Varricchio consider the learnability of unary output two-tape au- 
tomata from equivalence and multiplicity queries. In order to understand what 
a multiplicity query is, we first have to explain what an automaton with mul- 
tiplicity is. Suppose any field K, and a non-deterministic automaton that has 
assigned weights (i.e., elements from K) to each initial state, each final state 
and to each egde of the automaton. Such an automaton can be considered as 
computing a function that maps strings into elements of K, and a multiplicity 
query returns the value of this function for a given string. Now, the authors show 
that the behavior of a unary output two-tape automaton can be identified with 
the behavior of a suitably defined automaton with multiplicity. Thus, the origi- 
nal learning problem is reduced to that of learning the resulting automata with 
multiplicity. The learner provided has a query complexity that is polynomially 
bounded in the size of the automaton with multiplicity. 

Fischlin asks whether or not learning from membership queries can be sped 
up by parallelizing it. Defining the depth of a query q to be the number of other 
queries on which q depends upon and the query depth of a learning algorithm 
to be the maximum query depth taken over all queries made, the problem of 
whether or not a query learner can be parallelized is then equivalent to asking 
whether or not the query depth can be reduced. Assuming the existence of 
cryptographic one-way functions, Fischlin proves the following strong result: 
for any fixed polynomial d, there is a concept class C„ that is efficiently query 
learnable from membership queries alone in query depth d{n) + 1, but C„ cannot 
be weakly predicted from membership and equivalence queries in depth d{n). 

Damaschke provides a positive result concerning the parallelizability of lear- 
ning Boolean functions that depend on only a few variables. Moreover, he is not 
only looking at the overall number of queries but also at the resulting overall 
complexity of learning. 

Another lower bound for the overall number of membership queries is given 
by Shevchenko and Zolotykh. They consider the problem of learning half-spaces 
over finite subsets of the n-dimensional Euclidean vector space. This is done 
by carefully elaborating the structure of so-called teaching sets for half-spaces, 
i.e., sets such that only one half-space agrees with all points in them. The lower 
bound obtained is close to the best known upper bound. 

Ben-David and Lindenbaum (cf. EuroCOLT’95, LNAI 1208) proposed several 
models to adapt the idea of learning from positive data only (that has attracted 
so much attention in language learning) to the PAC model. Denis is continuing 
along this line of research. He defines an appropriate PAC model and shows that 
extra information concerning the underlying distribution must be provided. This 
information is obtained via statistical queries. A couple of concept classes are 
shown to be learnable within the model defined, e.g. /c-DNF and fc-decision lists. 

Learning Recursive Functions 

The area of learning recursive functions is traditionally well represented in the 
ALT series. The information given to the learner is usually augmenting sequences 
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/(O), /(I), /(2), ... of the function / to be learned. Admissible hypothesis spaces 
are recursive enumerations of partial recursive functions that comprise the tar- 
get class. Again, the learner has to output a sequence of hypotheses about the 
target function, i.e., indices or encodings of computer programs. The learning 
goal consists in identifying in the limit an index in the relevant enumeration such 
that the enumerated function correctly computes the target with respect to the 
correctness criterion introduced. Starting with Gold’s and Putnam’s pioneering 
work, this learning model has been widely studied. Many variations of this basic 
setting have been considered, and the present volume provides further specifica- 
tions. 

Suppose you have a learner for a class U\ and another class C/ 2 . Now, it 
would be nice to have a more powerful learner that can identify simultaneously 
U\ U C/ 2 . However, learning in the limit is not closed under union, and this fact 
led ApsTtis et al. to investigate the following refined version of closedness under 
union. Assume you have classes C/i, . . . , C/„ each of which is learnable in the limit. 
What can be said about the learnability of the union of all these classes provided 
that every union of at most n — 1 classes is learnable in the limit? Clearly, the 
answer may depend on n, since for n = 2 the answer is no as mentioned above. 
Therefore, more precisely, one has to ask whether or not there exists an n such 
that the union of all such classes is always learnable. The minimal such n is 
referred to as the closedness degree, and the authors determine the degree for a 
large number of learning types. 

A natural variation of the basic scenario described above is prediction. Now, 
instead of outputting hypotheses about the target function, the learner can keep 
its guesses. Instead, the learning success is measured by its ability to predict the 
function values for inputs not having seen before. That is, beyond some point 
the learner must be able to predict correctly. Again what is meant by correctly 
depends on the correctness criterion considered. Example for correctness criteria 
comprise always correct, correct for all but a finite number (prespecified or not), 
a certain fraction of the inputs and so on. Case et al. consider this model for 
the case that the targets may drift over time. While similar questions have been 
addressed within the PAG model, this is the first paper that studies concept drift 
in the more general, recursion theoretic setting. Different versions are proposed 
and related to one another. Moreover, the authors also analyze the learnability 
of some natural concept classes within their models. This is a nice combination 
of abstract and concrete examples. 

Whenever learning in the limit is concerned, one usually cannot decide whether 
or not the learner has already converged to its final hypothesis for the actual 
target. Thus, it seems only natural to demand the learner to correctly reflect 
all the data already seen. Learners behaving thus are said to be consistent. At 
fist glance, it may also seem useless to allow a learner to output inconsistent 
hypotheses. Nevertheless, consistency is a severe restriction as has been shown 
elsewhere. In his paper, Stein is investigating the question of how the demand 
to learn consistently may effect the complexity of learning. That is, assuming 
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a function class can be learned consistently, he shows that for every recursive 
bound, there is class that can be learned inconsistently with polynomial up- 
date time, but every consistent learner needs an update time which is above the 
recursive bound given. 

Case, Ott et al. generalize the scenario described above by looking at learning 
an infinite branch of a computable tree. Thus the concept class considered is the 
set of all computable trees that contain at least one infinite computable branch. 
This work derives its motivation from process-control games. 

Hirowatari and Arikawa extend the classical model to learning recursive real- 
valued functions. These function are regarded as computable interval mappings. 
Both coincidences and surprising differences to the learnability of natural-valued 
recursive functions are shown. In particular, these differences are established with 
respect to recursively enumerable classes and consistent identification. This work 
considerably extends their results presented at ALT’97. 

Barzdins and Sarkans continue their work (with varying coauthors) presented 
at previous ALT meetings to design practically feasible inference algorithms. 
The new feature consists in using attribute grammars to express several kinds 
of additional knowledge about the objects to be learned. 

We finish this section with the paper of Grieser et al. that looks at the learna- 
bility of recursive functions from quite a different perspective. The main problem 
studied is the validation of inductive learning systems. The authors propose a 
model for the validation task, and relate the amount of expertise necessary to 
validate a learning system to the amount of expertise needed for solving the lear- 
ning problems considered. Within the model introduced, the ability to validate 
a learning system implies the ability to solve it. 

Miscellaneous 

Arimura et al. study the data mining problem to discover two-word associa- 
tion rules in large collections of unstructured texts. They present very efficient 
algorithms solving this task. Nice applications are outlined and the algorithms 
have been implemented and tested using a huge database of amino acid sequen- 
ces. 

Schmitt is investigating the sample complexity for neural trees. A neural tree 
is a feedforward neural network with at most one edge outgoing from each node. 
The paper relates the sample complexity to the VC dimension of classes of neural 
trees. The main result is a lower bound of n log n for the sample complexity of 
neural trees with n inputs. 
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Abstract. Inductive Logic Programming is concerned with a difficult 
problem: learning in first-order representations. If stated in an unre- 
stricted fashion, ILP’s classical learning task, the inductive acquisition 
of first-order predictive theories from examples, is undecidable; even the 
more restricted practical tasks are known to be not polynomially PAC- 
learnable. The idea of using ILP techniques for Knowledge Discovery in 
Databases (KDD), or Data Mining, where very large datasets need to be 
analyzed, thus seems impossible at first sight. However, a number of re- 
cent advances have allowed ILP to make significant progress on the road 
to scalability. In this paper, we will give an illustrative overview of the 
basic aspects of scalability in ILP, and then described recent advances 
in theory, algorithms and system implementations. We will give exam- 
ples from implemented algorithms and briefly introduce Midos, a recent 
first-order subgroup discovery algorithm and its scalability ingredients. 



1 Introduction 

Data Mining, or Knowledge Diseovery in Databases (KDD), has recently been 
gaining widespread attention. In one popular definition, KDD is seen cis the 
“non-trivicil process of identifying valid, novel, potentially useful, and ultimately 
understandable patterns in data” [FPSS96], whereas data mining is usucilly seen 
cis one step in the iterative KDD process, namely the application of (semi- 
automatic) analysis methods to find results. As the definition indicates, KDD 
focuses on the entire process of going from real-world data to useful results, 
i.e., including not only data mining but also tasks such as data preprocessing 
(selection, deeming, tremsformation) and result postprocessing (evaluation, inter- 
pretation, use). KDD plcices a high emphcisis on the understandability, novelty 
cmd usefulness of results, i.e., does not evciluate results solely on the basis of 
validity or accuracy. Even though not explicitly mentioned in the definition, in 
a KDD context it is usually assumed that real-world datasets are very large and 
often stored in (relational) databcise management systems, resulting in very high 
demands on the scalability of analysis methods. 

When looking at the most populcir data mining methods, we see that KDD 
currently almost exclusively relies on propositional data analysis techniques, i.e., 
cilgorithms that ciccept a single table of simple- valued data cis input. Examples 
of such techniques for classification or prediction are decision tree learners such 
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as C4.5 [Qui93], propositioned rule lecirning methods such as CN2 [CN89], linecir 
and nonlinear regression methods such cis artificial neural networks [Hay98] or 
Bayesiem methods [CL96]. None of these propositional techniques can directly 
deed with the multiple relations that are typical of relational databases, and 
some of them do not produce very explicit cmd understandable output. On the 
other hand, first-order techniques as developed in the field of Inductive Logic 
Programming (ILP, see e.g. [MDR94, Wro96, NCW97]) up to now have been per- 
ceived cis too slow and plagued by scalability problems. Therefore, they cire not 
seen as idecd candidates for KDD even though they are capable of deeding with 
the multiple relations of a relational database directly, produce understandable 
output in explicit logical form and Cem even use rich background knowledge. 

Fortunately, recent work in the field of ILP hcis lead to significant progress 
in particulcir on the issue of scalability. In this paper, we present an illustrative 
overview of some of these advemces in theory, algorithms and system implemen- 
tations, focusing in particular on new task definitions and the use of sampling. In 
the next section, we will define the tcisk of ILP; in section 3, we discuss the bcisic 
complexity issues entailed by this task. We then describe the classical approaches 
to scalability in section 4, followed by more recent approciches in section 5. As an 
excimple of scalability techniques in context, we conclude with a brief discussion 
of Midos, a recent ILP subgroup discovery algorithm (section 6). 



2 The basics of Inductive Logic Programming 

Let us first illustrate the most important ILP learning task by a simplified ex- 
ample from the domain of telecommunications described in [SMAU94]. In this 
domciin, the goal was to replace an existing, manually created access control 
database with a set of verifiable access control rules in a declcirative language. 
The database stated, for ecich employee, which switching systems this employee 
was allowed to access, but did not include the reasons why this was the case. 
The access rules, on the other hand, were to use the avciilable background knowl- 
edge about the network and the employees, their affiliations and qualifications 
to decide about access rights in a genercil and explicit fcishion. 

In this application, the manucilly created access rights databcise can be rep- 
resented by a set of first-order facts that are used as positive examples of the 
target concept may_operate: 

may_operate(bode,pabx_17). 
may_operate( meyer, pa bx_15). 

Here, and for the rest of the paper, we follow the convention from logic pro- 
gramming and Prolog (see e.g. [Llo87]) and use lowercase names for predicates, 
functions, and constants, whereas vciriable symbols cilways begin with an upper- 
Ccise letter. 

Further facts about unauthorized access cire used cis negative examples of the 
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target concept^: 

not(may_operate(bode,pabx_15)). 

not(may_operate(meyer,pabx_17)). 

not(may_operate(miller,pabx_15)). 

To be able to lecirn from these examples, it is also necessary to represent the 
available background knowledge about the domain. Usucilly, this is done in the 
form of first-order facts and rules (clauses): 

operator(bode). works_for(bode,comtel). 

operates(telplus,pabx_17). subsidiary(comtel,telplus). 
engineer(meyer). works_for(meyer,nettalk). 

operates(talkline,pabx_17). subsidiary(netta Ik, talkline). 
accountant(miller). works_for(miller,nettalk). 

operator(X) ^ technical(X). 
engineer(X) — + technical(X). 

From the above information, cm ILP learning system can learn the rule: 

works_for(P,Cl) & operates(C2,S) & subsidiary(Cl,C2) 

& technical(P) ^ may_operate(P,S). 

stating that cill technical personnel in a subsidiciry company may operate cill 
systems managed by the parent compcmy. 

As cmother excimple, given examples of reverse, and append cis bcickground 
knowledge, an ILP progrcim Ccm find the following clause (lists written in Prolog 
notation): 

reverse(B,D) & append(D,[A],C) ^ reverse([A | B],C) 

More formally, we can define the bcisic task of ILP cis follows. Assume that 
we are given arbitrary, but fixed languages Lb, Le and Lh for background 
knowledge, examples and hypotheses. In ILP, these languages cire subsets of first- 
order logic. In addition, we assume that we have cm entailment relationship |= on 
sets of statements from these Icmguages. In ILP, this is usually logiccil entailment 
or a subset thereof; we require that |= be reflexive {r\=r) and transitive (/i |=/2 
cmd 721=^3 implies /i|=/ 3 ). We write F\=0 to denote that F is inconsistent. We 
can now formulate a precise definition of the lecirning from excimples problem 
[Wro96]). 

Definition 1 ILP prediction learning problem. For a background knowl- 
edge language Lb, sn example language Le, & hypothesis Icmguage Lb, and 
cm entailment relationship |=, we call the following problem the ILP prediction 
learning problem ILPp{Lb,Le,Lh, |=): 

Given: 

^ Some learning systems allow the user to specify that all unstated examples are neg- 
ative, in which case negative examples do not need to be specified explicitly. 
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— background knowledge B expressed in the background knowledge language 
Lb 

— positive excunples E'^ expressed in the example language Le 

— negative examples expressed in the excunple Icmguage Le 

— such that B is consistent with E^ and E^ {B U E~^ U E^ 

Find: 

— a learning hypothesis H expressed in the hypothesis language Lh 

— such that H is complete, i.e., together with B entails (“covers”) the positive 
examples {H U i?|=£i+) 

— H is correct, i.e., is consistent with the negative examples {HUBUE^ 

— cmd H meets a pcirticular preference bias. 

The solutions to the above learning problem are by no means unique; for 
many problems, there will be an infinite number of solutions. Since we require 
B U E~^ U E^ there is also a trivial solution, namely H := £ 1 + (assuming 

Le C Lh). It is the role of the preference bias to exclude unwanted hypotheses 
such as the trivial one. A common preference bias is to require H to be the most 
general complete and correct hypothesis. 

Definition 2 Generality. A hypothesis is said to be more general than a 
hypothesis H2, written H\>g H2 iff 






Also, in most practiced applications, requiring completeness and consistency is 
inappropriate since usually, examples and/or background knowledge are noisy. 
Prcictical algorithms therefore use statistical criteria instecid of strict consistency 
and completeness to evaluate hypotheses. 



3 Basic complexity issues in ILP 

The computational difficulty of the problem defined above depends on exactly 
how its primary components are instemtiated, i.e., which languages cire allowed 
for the specification of examples, background knowledge, and hypothesis, and 
which notion of explanation is used. In the most general case, full first-order 
logic could be used as a language, cmd as stated above, explanation would be 
identified with genered logiced entailment . Due to the undecidability of first-order 
logic, in this case it is not even decidable whether a hypothesis is a solution to the 
learning problem. One of the central goals of ILP is thus to find restrictions of 
the various components of the lecirning problem that on the one hand mcike the 
problem easier while on the other still allow interesting concepts to be learned. 
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3.1 Choice of the language 

On the Icinguage side, all ILP systems in use today employ clausal first-order logic 
cis their representation, usually concentrating on Horn clauses, i.e., clauses with 
one positive litercil (hecid). Entailment in these languages is still undecidable, so 
many systems in addition cissume function-free languages which cire decidable 
cmd have turned out to be adequate for many prcictical problems. Furthermore, 
it is possible to automatically trcmsform a clausal representation with function 
symbols into a function-free representation using a technique known as flattening 
[RP89]. However, the cmswer set of the transformed function-free progrcim is 
equivalent to the original program with functions only under certain conditions 
[Sta94] . 

In most practiced settings, one is interested in predicting membership in a 
single tcirget predicate, so most systems concentrate on learning clauses for a 
single predicate only. Multi-clause hypotheses are typically learned by a cover- 
ing approcich, where the system first induces a single clause to cover one part of 
the positive examples, cmd then further clauses to cover the remaining examples 
In the following, we will concentrate on the task of lecirning single most general 
clauses, keeping in mind that multiple-clause, multiple-predicate hypotheses in- 
troduce extra complexity (see [RLD93] for multiple-clause, multiple-predicate 
lecirning) . 



3.2 Replacing |= by 0-subsumption 

Even for a Horn clause progrcim consisting of one ground fact, one ground query 
cmd two Horn clauses with two literals, the ILP problem defined above remains 
undecidable (see [KD94] for an overview). Therefore, almost all approaches in 
ILP today have chosen to replace (sememtic) logical entcdlment (|=) by (syn- 
tactic) 0-subsumption {>$), defined by [Rob65] and first used for learning by 
Plotkin [Plo70]. 

0-subsumption Let c and c' be two progrcim clauses (in set notation). Clause c 
9-subsumes d {c>gc') if there exists a substitution 0, such that c0 C c'. Two 
clauses c and d are 0-subsumption equivalent {c=gd) if c>gd and d>gc. A 
clause is reduced if it is not 0-subsumption equivalent to cmy proper subset 
of itself. 

An important property of 0-subsumption is that if c>gc' , then c |= c'. The 
converse is not true, cis shown by the following excimples. 

As illustrated by the last example, the incompleteness of 0-subsumption is 
due to clauses that Cem be used recursively on themselves. If these so-called self- 
recursive clauses are not allowed, 0-subsumption is complete with respect to |= 
[Got87]. While there is current work in ILP on inverting implication (see e.g. 
the Progol system [Mug95]), most other algorithms currently in use cire based 
on 0-subsumption. Checking 0-subsumption is an NP-complete problem [GJ79, 
p. 264], but in memy Ccises it is possible to exploit structural properties of the 
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ci: = parent(Y,X) — » daughter(X,Y) 

C2: = parent(ann,mary) — » daugliter(mary,ann) 

C3:=female(mary), parent(ann,mary), parent(tom,mary) — » daughter(mary,ann) 
ci>eC2 and ci>6iC3, both with 9 = {X / mary ,Y / ann} . 

C4: = parent(Y,X), parent(U,V) — » daughter(X,Y) 

ci=6iC4; Cl is reduced, C4 is not reduced. 



CB:=q(X,Y,Z) q(Y,Z,X), ce:=q(X,Y,Z) q(Z.X,Y) 
cb ^eCfi and ce ^ecs, but cb |= ce and ce |= cb. 



Table 1. 0-subsumption examples. 



data (determincicy cind locality) to greatly reduce the combinatorial matching 
problem and runtime when compared to a blind matching approach [KL94] . 

Even though 0-subsumption is defined on pairs of clauses, it can still be used 
with background knowledge B. Assume we are given a set of atoms K such that 
B \= K. We can then transform the ILP problem specification as follows. Given 
unary clause examples E+ cmd , we cire looking for a hypothesis H such that: 

BUH \=E+ BU HU E- 

^H\=B^P ^H^BuE--^a 

^H\=K^E+ ^H^KuE-^a 

<= H>eK^ e+,Ve+ £ E+ ^ H teK^ ej) e E- 

Just cis the replacement of implication with 0-subsumption, this transformation 
causes the learner to miss some complete hypotheses and may cause it to output 
incorrect hypotheses (with respect to the origincil problem statement using im- 
plication). If B is not a set of atoms, K is usucilly taken to be some finite subset 
of all atomic statements entailed by B, computed by a limited-depth deduction 
process (e.g. saturation [RP89]). 



3.3 Towards polynomial learnability 

While the above restrictions ensure a decidable learning problem, they are in- 
sufficient to reach polynomial learnability. Following the popularity of Valiant’s 
probably approximately correct (PAG) learning framework in the propositional 
learning cirea, significant effort was invested in ILP to examine whether polyno- 
mially PAG-lecirnable ILP lecirning tasks could be found. Indeed several authors 
were able to prove positive results for restricted vciriants of the above problem. 
Here are some examples cill of which assume (as explained above) that the gocil 
is to learn one or more clauses about a single tcirget predicate, and that the 
available bcickground knowledge allows any atomic query to be proved in time 
polynomial in the length of the atom^ . 

^ See [KD94] or [NCW97, ch. 18] for detailed overviews of PAC-learnability and ILP. 
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— function-free non-recursive clauses of at most k literals each are polynomicil 
time PAC lecirnable if either all body variables occur in the head of the 
respective clause {constrained clauses) [DMR92], or background knowledge 
consists only of function-free ground atoms [Coh93] . 

— sets of at most k nonrecursive clauses are polynomial time PAC predictable if 
they are i,j -determinate [Dvz95]. This roughly means that the instantiations 
of head variables in the clause uniquely determine (through a chain of at 
most i shared vciriables) the instantiations of all body literals with respect 
to the available examples and bcickground knowledge. If i is not restricted, 
the problem is no longer polynomial time PAC-learnable, neither so if we 
allow 1, 2-non-determinate clauses (if RP ^ PSPACE, [Kie93]). 

Unfortunately, in order to obtain these cmd other positive PAC-learnability re- 
sults, the learning problem had to be restricted quite severely. For practiced 
applications, it is not reasonable to cissume lecirned clauses contain no free body 
variables, nor is it always possible to provide a smcdl limit on the number k 
of necessary litereds in the body of learned clauses. Similarly, the property of 
i,y-determinacy cissumes that there is only one way to match a clause against 
examples and bcickground knowledge which is not the case in memy ILP ap- 
plications. In fact, an i,y-determinate learning problem can be mapped onto a 
propositional lecirning problem that is only polynomially larger [KD94]. This 
meems that the polynomial time PAC-lecirnability of i,y-determinate clauses is 
inherited from the propositional domain, and the problem has essentially been 
restricted to a propositioned problem. 

Nonetheless, even in this case it is a tremendous cidvantage in terms of use- 
ability cmd understandability to be able to use first-order representations: even 
though a propositioned equivalent exists, it would be very cumbersome to con- 
struct manucdly. Furthermore, in problems that are not entirely determinate, 
it is very advantageous to construct the representation such that as many lit- 
erals as possible are determinate, thus reducing the matching effort required. 
Further gains can be made by moving tree-structured determinate relationships 
into first-order terms which cire hcmdled efficiently by ILP algoritms (simply 
through unification or using special edgorithms for terms as e.g. RIBL’s tree 
edit distances [BHW98]). 

4 Classic ingredients of scalability 

Beyond theoretical learnability considerations, ILP has always constructed learn- 
ing systems that, despite prohibitive theoretical results, have performed well cmd 
with acceptable runtimes in practice. This is not primarily a critique of inad- 
equate learnability models, even though some authors have started to propose 
cdternative lecirnability models like U -learnability which focuses on instance dis- 
tributions with certedn benign properties and cdlows more positive results to 
be proved [MP94]. Instecid, it is the result of several practical ingredients com- 
monly found in ILP algorithms: declarative bias, secirch ordering cmd pruning, 
cmd search heuristics. 
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4.1 Declarative bias 

Declarative bias means that the learning system allows the user to specify a 
different hypothesis space for each cictual instance of cm ILP learning problem. 
Within the boundaries of high-level restrictions such as given above (e.g. re- 
striction to function-free Horn clauses), the user can identify a subspcice that is 
deemed appropriate to the current learning problem instance. Just how the user 
is to communicate his or her knowledge about the desired subspcice has been 
an intensively studied subject. One Ccm generally distinguish several kinds of 
approaches of increasing complexity: 

Type and mode declarations These cire the simplest form of declarative bias. 
A type declciration specifies, for ecich predicate and each of its argument 
places, which type (sort) of value is allowed. The learning system Ccm use 
this information to avoid identifying variables with incompatible types, thus 
ruling out many useless clauses from consideration. An argument’s mode 
declaration specifies whether the argument is an input or output cirgument. 
Clauses cire then constructed to mcike sure that all input cirguments of a 
litercil are bound by preceding literals, again ruling out many uninteresting 
clauses. Sometimes, mode declcirations cire used to also specify how many 
instantiations of output arguments are possible given instantiated input ar- 
guments; this Ccm be used to further optimize the search. Practiccilly all ILP 
systems use type and/or mode declarations, the currently populcir form of 
type cmd mode declarations was introduced for the Progol system [Mug95]. 
Schemata and templates Schemata [KW92] and templates [W091] are clauses 
with designated places into which cictual predicates and/or arguments can 
be inserted. Since the syntcictic form of a schema or template is more or less 
fixed, the user can ecisily see which hypotheses will be allowed by a schema, 
cmd exert very fine-grained control over the hypothesis spcice. For small tar- 
get hypothesis spaces cmd well-understood learning targets, this is a good 
approcich. Many problems require overly large sets of templates so that a 
generative declarative bias language is preferable. 

Generative approaches Generative approaches to declarative bias consist of 
programs or grcimmars that compute the cillowed hypotheses, i.e., there does 
not need to be a direct correspondence between the syntcix of the bias expres- 
sion and the allowed hypotheses. A well-developed generative bias language 
is DLAB [DRD97] which allows both simple template-style specifications 
cmd complex generative expressions. 

4.2 Search ordering 

Within a given hypothesis space (with or without declarative bias), efficiency 
Ccm be gained by exploiting the logical genercility structure of the hypothesis 
spcice and using appropriate refinement operators. 

Definitions Refinement. A (downward) refinement operator \s a mapping 

P:Lh^ 
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such that for all H' € p{H) : H >g H' . 

Hypothesis search then proceeds from the most general hypothesis to more spe- 
cicil hypotheses. If a hypothesis is found to contradict (too many) negative exam- 
ples, search continues with its refinements. For this to be effective, the refinement 
operator should map each hypothesis to a finite set of refinements, cmd it should 
produce every hypothesis in the hypothesis space in finitely many steps when 
starting from the most general hypothesis^. This structuring of the search allows 
the basic forms of pruning. Whenever a hypothesis is found to exclude all (or 
enough) negative examples, we need not consider any of its refinements. Simi- 
larly, whenever a hypothesis is found to cover too few positive examples, it and 
cill its refinements can be excluded from consideration. 

Further optimizations are possible if an optimal refinement operator is used. 
A refinement operator is optimcil if each hypothesis is produced along exactly 
one refinement path. Whenever hypotheses are produced cilong several paths, we 
need to keep track of visited hypotheses to avoid reexploring parts of the space 
that have already been visited. For optimal refinement operators, this is un- 
necessary, which is especially importcmt for parallelization of secirch cilgorithms. 
Optimal refinement operators are used e.g. for hypothesis spaces specified us- 
ing the declcirative bias language DLAB [DRD97] cmd for the foreign link bias 
language of MiDOS ([Wro97], see below). 

4.3 Search heuristics 

Search heuristics can be used to either select a pcirticular order of examining the 
search space (so that the best hypotheses are found first), or, more commonly, 
to greedily eliminate parts of the hypothesis space from consideration. Many of 
these criteria are bcised on the idea that when refining, it is best to continue 
with those refinements that best separate positive from negative examples. This 
idea can be mapped to vcirious statistical criteria, one of the simplest is the 
information gain measure used in FOIL [Qui90]. If we let 

e+{H) :=| {e+ e E+ \ H covers e+} | 

cmd 

e{H) :=\ {e € E~^ U E^ \ H covers e} | 

the information gciin of a refinement hypothesis H' with respect to the origincil 
hypothesis H can be written as: 

7G(F, H') := e+{H') ■ {-log^{^-^) + log^{^-^)) 

i.e.. Foil uses only the information content of positive tuples (probabilities ap- 
proximated as relative frequencies) and weights the information gain so that 
refinements which cover many positive examples are preferred^. 

® Ideally, each refinement should also be strictly less general than its predecessor to 
ensure the search does not get stuck. For full clausal languages, this cannot be 
achieved, however, without giving up one of the first two desired properties [NCW97]. 
^ The actual computation in Foil is more complex. 
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5 Recent advances in practical scalability 

In addition to the classic scalability ingredients described above, recently a num- 
ber of alternative approaches and ideas have successfully been introduced into 
ILP systems. First, the use of cilternative formulations of the ILP learning prob- 
lem has allowed parallelizable cilgorithms and the use of several smaller opti- 
mizations already known in propositional learning. Second, the introduction of 
Scimpling techniques has allowed significant speedups while still maintaining as- 
ymptotically perfect accuracies®. 



5.1 Alternative ILP problem definitions 

Prediction is only one of the possible interesting lecirning tcisks. In many situa- 
tions, it Ccm be more appropriate to look for hypotheses that describe the existing 
data instead of being optimized for predicting new data. With the growing popu- 
Icirity of knowledge discovery, such aucilysis tasks have also become more popular 
in ILP, and several authors have proposed ILP problem definitions that can be 
seen as vciriants of a descriptive learning task (e.g. [Hel89, Fla92, DRD94]). 
These definitions can be generalized into the following task definition of descrip- 
tive lecirning in ILP, phrased using a generic description relationship \=o between 
datasets cmd hypotheses [WD95]®. 

Definition 4 ILP description problem. For a background knowledge language 
Lb, & dataset Icmguage Ld, & hypothesis Icmguage Lh, an entciilment relation- 
ship |=, and a description relationship |=£i, we Ccill the following problem the 
ILP description learning problem ILPd{Lb, Lb, Lh, |=, |=d): 

Given: 

— background knowledge B expressed in the background knowledge language 
Lb 

— data D expressed in the datciset language Lb 

— such that B is consistent with D{B U D ^ □) 

Find: 

— a learning hypothesis H expressed in the hypothesis language Lh 

— such that H describes B and the data D {{B,D) \=b H), 

Often, we also require that 

— H is complete, i.e., for cmy other descriptive hypothesis H' in Lh, H \= H' , 

— cmd H is non- redundant , i.e., for cmy proper subset H' C H , H' H. 

® We do not describe preprocessing techniques like discretization or attribute selection 
here, since these are general techniques external to ILP algorithms. Nonetheless, 
their use is of course essential to scalability. 

® We have omitted the use of uninteresting datasets (“negative examples”), since they 
are not important for our purposes here. 
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As described in [WD95] the properties of this learning problem depend not 
only on Lb, Lb, Lb and |=, but crucially on the choice of \=d, the description 
relationship. One important interpretation is the one made in the so-Ccilled learn- 
ing from interpretations or non-monotonic ILP setting [DRD97]. In this setting, 
it is cissumed that Lb and Lb are such that a unique minimal Herbrand model 
can be assumed to exist. Then, we can define for a hypothesis H to describe a 
set of data D: 



(B,D) |=x+ H := H is in true in U B), the minimal Herbrand 

model of Du B. 

Tciking the minimal Herbrand model amounts to mciking the closed-world as- 
sumption, since cmything not explicitly stated or inferrable is assumed to be 
false. This description relationship thus is appropriate especicilly when we Ccm 
safely assume that we have complete descriptions of the objects under consider- 
ation. 

From the point of view of scalability, the above is important since it entails 
the following property: 

If (B.D) \=M+ Hi and (B,D) \=m+ H^, then (B,D) \=m+ Hi U i ?2 
(compositionality or monotonicity). 

This mecms that, in contrast to the regular ILP prediction problem, it is safe 
to consider each hypothesis clause separately cmd then join them together at 
the end. In the regular setting, a correct search for multi-clause hypotheses 
is so complex that it is usually avoided and replciced by a covering approcich 
[RLD93]; in the nonmonotonic setting, these problems do not arise. Practically 
speaking, monotonicity means that pcircillel implementations of the search are 
easily possible by assigning different subarecis of the hypothesis spcice to different 
processors. If an optimal refinement operator is used, this can be done with very 
little communication, and indeed the parallel version of Claudien, a populcir 
lecirner for the nonmonotonic setting, has shown a speedup almost linear in the 
number of processors (for up to 16 processors, see [DRD97]). 

The nonmonotonic setting Ccm also be seen in terms of PAC-lecirnability 
if D is understood cis a collection of independent subsets. This quite often is a 
reasonable cissumption, since the data usually will be describing different objects 
or Ccises, and each data subset can collect all the information about one such 
object or case. The nonmonotonic description relationship is then interpreted in 
a local fashion: 

{B, D) \=M+ H :=for all d € T>, is in true in A4~^(d U B). 

The minimal model of each data set is an interpretation (a pcirticular subset of 
the Herbrand universe), cmd the set of cill such interpretations can be regarded 
cis an instance space within which the given datasets are the positive excimples. 
In this view, it has been possible to prove that first-order clausal theories are 
polynomially PAC-learnable (with one-sided error from positive examples only) 
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if they consist of clauses that are allowed (all hecid variables occur in the body) 
and have no more than k litercils of size at most j each [DRD94] . 

Prcictically, the nonmonotonic interpretation of datasets brings about a very 
close relationship to propositional lecirning where also we are assuming complete 
knowledge about the observed objects. This relationship has allowed a number 
of important propositioned learning algorithms to be upgraded to the first-order 
Ccise, most notably decision tree learning, where the first-order decision tree 
learner Tilde [BDR98] has proven to combine good learning accuracies with 
excellent scalability. 

5.2 Sampling 

All of the approaches to scalability described above cire aimed at reducing the 
number of hypotheses that need to be considered, or at cdlowing parallel and in- 
dependent search for these hypotheses. Sampling approaches to scedability tackle 
the orthogonal problem of reducing the time it tcikes to check each individual 
hypothesis on the data. In other words, the former approaches deal with compu- 
tatioued challenges introduced by the complexity of the data, wherecis Scimpling 
deeds with chcdlenges introduced by the amount of data. 

We Cem currently distinguish three ways in which sampling enters into ILP 
algorithms, Ucimely global sampling approaches, local Scimpling approaches and 
Scimpling of substitutions. 

Global sampling Global sampling in its simple form is probably the most pop- 
ular and most heavily criticized data preprocessing technique in KDD: if the en- 
tire dataset cannot be handled by the chosen analysis method, we select a small 
random sample and compute the analysis result on the sample. Clearly, this is 
problematic since there is no gucircmtee that results on the sample are the same 
as results on the entire dataset, not only because of remdom fluctuations in the 
Scimple, but cdso because of heuristics in the cmalysis method that may enlarge 
such fluctuations. 

For predictive learning tcisks, these problems Cem be somewhat alleviated by 
approaches known as windowing [Qui93]. Windowing selects an initied subset of 
the excimples, learns on these examples, and then checks the results on all exam- 
ples. The edgorithm then constructs a new sample of tredning excimples adding 
in a sample from those examples that were incorrectly hcmdled by the previous 
learning result. Srinivasan [Sri97] has adapted this approcich for ILP in a method 
Ccdled logical windowing. This approach differs from the basic scheme described 
above by using individued clause deletions to cirrive at the final hypothesis (a kind 
of covering strategy) instead of relecirning cm entire theory on ecich subsequent 
Scimple. 

An alternative to windowing is layered learning [Mug93] which is possible if 
the hypothesis language cdlows the formulation of a hierarchy of theories where 
ecich lower level theory handles only the exceptions of the higher-level theory. 
In ILP, this is possible through the use of non-monotonically interpreted excep- 
tion predicates that are attached to clauses (see e.g. [BM92, Wro94]). Layered 
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lecirning then proceeds by first taking a small Scimple of data, constructing an 
initial theory, and then at each iteration constructing the next theory, training 
only on the errors of the first in a new sample that is a superset of the first. 
In this case, assuming we know the size of the concept class (2"), it is possi- 
ble based on lower-bound PAC results to derive optimal sizes for the successive 
samples to achieve exponentially decrecising error with linearly growing training 
sets [Mug93]. 

In [Sri97], the results of extensive experiments with both logical window- 
ing and layered learning on two domains (king-rook-king end games and part- 
of-speech tagging) cire reported. The experiments compcired the runtimes and 
ciccurcicies of the ILP learner Progol [Mug95] with the layered-learning vari- 
cmt of Progol and with a variant using a logical windowing wrapper ciround 
Progol. The experiments indicate that in practice, logical windowing seems 
to work better, delivering slightly better accuracies and shorter runtimes them 
layered lecirning. Compared to the nonsampling version of Prolog, runtimes were 
shorter by almost an order of magnitude, while accuracies remained compara- 
ble. These results cire important especially since logical windowing can always be 
employed in addition to other optimizations like local or substitution sampling. 

Local sampling Local sampling [Wro97] differs from global sampling in that 
sampling is done with respect to the hypothesis that is currently being consid- 
ered. Whereas global sampling uses one sample for all hypotheses considered in 
a step, and precisely computes accurcicies on the sample, local sampling uses a 
different-size sample for each hypothesis and selects precisely the Scimple size 
that is necessary to cichieve the desired accuracy in estimating the hypothesis’ 
properties. If, as in MiDOS [Wro97], a relative frequency is to be estimated, a 
(generous) upper bound on the required sample size can be computed based on 
elementary results from statistics. Estimating a probability amounts to repeat- 
edly drawing samples and checking whether they have the required property, 
e.g. whether the current hypothesis covers the example. Repeating this experi- 
ment means we are getting a binomial distribution with underlying probability 
p which is the probability we cire interested in. We estimate this probability by 
p' ;= -, where x is the number of “successes” after drawing s samples. 

We then need a statement about the probability that \p — p'\ is smaller than 
a specified error threshold e. From the statistics literature, we Cem use the so- 
called Chernoff bound ([AS92] foil. [AMS+96]) for this purpose. According to 
this bound, for any a > 0, 

P{x > sp + a) < 

For the difference between estimated and actual probability, we obtain 

P{p' > p + e) = P{x > sp + se) < 

Thus, for truly remdom samples, we can precisely determine which sample size 
is needed to stay below a desired error probability S. In section 6 below, we will 
be discussing the use of this kind of sampling in an ILP subgroup discovery 
cilgorithm. 
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Sampling of substitutions Sampling of substitutions (“stochastic matching”, 
[SR97]) is a recently introduced technique that is even more fine-grciined than 
local Scimpling, in that it is concerned with the process of checking a hypothesis 
on one particulcir example, i.e., with the task of checking 0-subsumption. As 
defined above, 0-subsumption is a combinatorially expensive process, since in 
order to find the matching subsets of clauses, all pairs of literals with the same 
predicate symbol must be checked. In determinate domains, this is not a problem 
since there will only be one such match for ecich excimple; in highly nondetermi- 
nate domains, this can be the major problem dominating the runtime of an ILP 
method. Stochcistic matching was introduced for one such application (mutage- 
nesis) in which excimples contain up to 40 literals for one predicate, resulting in 
40* possible matches for clauses with k such literals. 

Stochastic matching simply means to consider only a fixed number of the 
possible matches between hypothesis literals and example literals. Clecirly, cis 
more and more matches are considered, in the limit stochastic matching ap- 
proaches the standard definition of 0-subsumption. Since appropriate samples 
Ccm also be constructed in polynomial time (in the number of literals cmd vari- 
ables in the pcirticipating clause), both hypothesis construction cmd checking can 
be done in polynomial time. The stochcistic matching approcich was first realized 
and experimentally examined in the ILP learning system STILL [SR97]. The 
reported experimented results proved very encouraging. With Scimple sizes of 300 
for hypothesis construction and 3 for hypothesis checking, STILL achieved accu- 
racies comparable to those of non-sampling ILP learners while reaching runtimes 
that were two to three order of magnitude faster. While this extreme speedup 
is due to the strong nondeterminacy of the domain, smaller speedups can be 
expected in most domains. In addition, stochastic matching Cem be combined 
with example-bcised sampling approciches (however, their intercictions have not 
been excimined). 

6 Ingredients in context: Midos 

We conclude by presenting a little more detail from our recent work on MiDOS, 
an ILP subgroup discovery system that combines several of the scalability in- 
gredients introduced above: novel task definition bcised on the descriptive learn- 
ing problem, declarative bicis, optimal refinement, advanced pruning bcised on 
minimal support and optimistic estimates, and local sampling for estimating 
frequencies. 



6.1 The subgroup discovery task 

The subgroup discovery task is a variemt of the descriptive learning task defined 
above. We cire interested in finding hypotheses that describe interesting sub- 
groups of the population, where interestingness is interpreted cis combination of 
Icirge size (generality) and distributional unusualness. As typical example, in a 
medical application we cire looking at, an interesting subgroup that we would 
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like to discover could be “the subgroup of patients who once had a treatment in 
a small hospital are significantly more likely to suffer from complications them 
the reference population”. For another example, here is a sample subgroup dis- 
covery result from a business application (identifying unusucil club membership 
distributions)^. 

Tcirget Type is: nominal( [member, nonjnember]) 

Reference Distribution is: [66.1%, 33.9% - 1371 objects] 

Sex=femcile, ID=order. Customer ID, order.Paymt Mode=credit_card 
[69.9%, 30.1% - 478 objects] [1.53882%%] 

We see that the entire population consisted of 1371 objects (i.e., customers) of 
which 66.1% cire club members. In contrast, in the subgroup of female credit 
card buyers (478 customers), 69.9% are club members. This finding is assigned 
a quality value of 1.53882%% by MiDOS. 

The requirements of subgroup discovery can be seen as soft variants of the 
corresponding requirements in descriptive learning, where the description rela- 
tionship is defined with respect to distributioncil unusualness, cmd the generality 
requirement trcmslates into preference for larger discovered groups. The set- 
ting considered makes one further assumption that is important for scalability, 
ncunely, that the k best such groups cire to be discovered. The multi-relation 
subgroup discovery task is more precisely defined cis follows. 

Definition 5 Multi-relational subgroup discovery. Given 

— a relational database D with relations R = {ri, ...jT^} 

— a hypothesis language Lh (language of group descriptions) 

— cm evciluation function d : h G Lh ,D—>-[0, 1] 

— cm integer k > 0 

Find: 

— a set 77 C Lh of hypotheses of size at most k, 

— such that for ecich h G H, d{h, D) > 0 

— cmd for cmy h' € Lh\H, d{h',D) < minh^Hd{h,D). 

The evciluation function d is defined as follows, based on the evaluation mea- 
sures that have been defined for propositioned algorithms [K1696]. 

Assume we are given a designated object relation Vg with key attributes K 
that is part of a database D to be excunined. For the simplest case, a binary goal 
attribute Ag inro, let T := {t ^ Tg \ ?’o[Ag] = 1} denote the set of target object 
tuples, define g{h) := qyy, the generality of a hypothesis, and probabilities 
Pq := 2 ind p{h) := • The chosen eveduation function is defined as: 

Only one group is shown. The printout does not use standard logical syntax, since 
the arity of predicates can be in the dozens or hundreds, and only very few arguments 
are typically different from 
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d{h) := 



Oif I c{h) I / I 

I ^ ^min 

i-g{h) (P(^) “ Po)^otherwise 



Here, Smin is a user-provided minimal group size; an assumption commonly made 
for data mining applications (“minimal support” in cissociation rule discovery, 
[AMS+96]. 



6.2 Scalability ingredients 



Midos uses a relatively stcmdard hypothesis space, describing each possible in- 
teresting group by a conjunction of linked function-free literals. In ciddition, 
the system employs a so-called foreign link bias inspired by the foreign key 
links available in many databcise systems. We interpret these links as designated 
shcired- variable paths along which different relations Ccm be joined together. For- 
eign links thus cire more fine-grained them type declarations, cmd can be used 
to strongly limit the size of the hypothesis space. In ciddition, it is possible to 
ecisily define an optimcil refinement operator for the resulting hypothesis spcice, 
so that the corresponding search optimizations can be made. MiDOS employs 
a top-down search from most genercil to most specific groups (brecidth-first or 
best-first guided by the optimistic estimates described below). At any point, the 
algorithm mciintciins the set of k best hypotheses found so fcir. 

Pruning during this search is based on two properties. First, since groups can 
only get smellier during refinement, the minimal support Smin can be used to cut 
off branches that fall below. Second, Midos exploits the fact that the k best 
existing solutions are known at any point using a technique known as optimistic 
estimate pruning. Note that the qucility of considered groups can both increcise 
or decrease as groups get smaller, depending on just how object properties are 
distributed. Thus, we cannot simply prune whenever the quality of a group is 
below all k existing solutions. However, it is possible to derive from d a function 
dmax that is guaranteed to be an upper bound on d for all refinements of a 
hypothesis. Thus, whenever this optimistic estimate is lower than the qucilities 
of hypotheses found so fcir, we Cem safely prune. 

Finally, in order to determine d on large datasets, Midos uses loccil Scimpling 
as described above to determine the relevant g and p values with a desired error 
and certainty. The primary technical problem in doing this is random sampling 
from c(/i), which (in database terminology) is defined by a project-select-join 
query for which uniform inclusion probabilities can be achieved only under very 
special circumstances [01k93]. Fortunately, since we know the set of possible 
query results (the members of the object relation Tq), sampling can be performed 
by random sampling this relation, and then checking whether the Scimple has the 
required properties. This is efficient for hypotheses with sufficient coverage, since 
it can be expected that we do not need to sample the entire relation rg to find 
sufficiently many instemces of the hypothesis. 
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7 Conclusion 

We have presented a short tour and overview of the complexity and sccilabil- 
ity issues in Inductive Logic Programming (ILP). As explained, early work in 
ILP had concentrated on fundamental complexity cispects of the ILP prediction 
lecirning task, showing that even the commonly considered restrictions of the tcisk 
(language and subsumption relationship) were insufEcient to reach polynomial 
lecirnability, and that further restrictions that are often unrealistic are needed 
for positive theoretical results. Practiced systems at the same time concentrated 
on scalability ingredients that are difhcult to cmalyze theoreticcdly: declarative 
bias, ordered and pruned search and heuristic search control. 

Among the recently introduced techniques that influence scalability, the use 
of descriptive learning tasks is certainly the most fundamented, both in the lecirn- 
ing from interpretations setting, which allows easy upgrade of powerful proposi- 
tional learning techniques as well as pcircdlel search, and in the more KDD-like 
reinterpretations like subgroup discovery. A more direct influence on scalability, 
however, is due to the newly introduced Scimpling techniques where orders of 
magnitude improvements in running times have been achieved while maintain- 
ing cidequate accuracies. The chcdlenge that is remaining lies in the theoreticed 
cmalysis of these techniques. There currently is no way of estimating the loss in 
ciccurcicy brought about by either of globed Scimpling and stochastic matching, 
cmd even for local sampling, where in principle a bound is available, this local 
bound does not translate into a bound on predictive ciccurcicy. These could be 
chcdlenges for further theoretical developments. 

Finally, one topic we have not been touching upon at all in this paper is 
scalability with respect to databases. Even though several ILP edgorithms have 
been coupled to relational databases already, optimizations specific to database 
management systems have not been mcide. Similarly, there are no theoreticed 
models that would allow to judge the computational complexity of an algorithm 
when used in the context of disk-based storage, where disk access patterns and 
resulting paging cmd network tremsfer times Cem totally dominate runtimes. In 
such a context, standard computational complexity analyses have misleciding 
results since it is not clear what Cem recisonably be regarded as a constemt-time 
operation. This, however, is not cm ILP-speciflc problem, but affects the design 
of data mining algorithms in general. 
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Abstract. The present paper focuses on some interesting classes of 
process-control games, where winning essentially means successfully con- 
trolling the process. A master for one of these games is an agent who 
plays a winning-strategy. In this paper we investigate situations, in which 
even a complete model (given by a program) of a particular game does 
not provide enough information to synthesize — even in the limit — a 
winning strategy. However, if in addition to getting a program, a machine 
may also watch masters play winning strategies, then the machine is able 
to learn in the limit a winning strategy for the given game. Studied are 
successful learning from arbitrary masters and from pedagogically use- 
ful selected masters. It is shown that selected masters are strictly more 
helpful for learning than are arbitrary masters. Both for learning from ar- 
bitrary masters and for learning from selected masters, though, there are 
cases where one can learn programs for winning strategies from masters 
but not if one is required to learn a program for the master’s strategy 
itself. Both for learning from arbitrary masters and for learning from 
selected masters, one can learn strictly more watching m -|- 1 masters 
than one can learn watching only m. Lastly a simulation result is pre- 
sented where the presence of a selected master reduces the complexity 
from infinitely many semantic mind changes to finitely many syntactic 
ones. 



1 Introduction 

To learn to win games such as chess, besides exploring the game tree with many 
practice games, it is also useful, or may even be necessary, to study the games 
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of master players Q We do not have much access to the masters’ actual strategic 
programs mostly stored in their subconscious wetware. We have, instead, access 
to their game-playing behavior. It is also apparently useful to study the (game- 
playing) behavior of masters who play with very different styles. For example, 
it is likely better to study the behavior of both Kasparov and Deep Blue0 than 
to study only one of them. 

In machine learning, the behavioral cloning approach to process-control, sur- 
veyed in involves using data from the behavior of master or expert human 
controllers, in order to make complex control learning problems feasible. For 
example, it has been used successfully to teach an autopilot to fly an aircraft 
simulator and to teach a machine to operate efficiently a free- 

swinging shipyard crane HES). Behavioral cloning partly motivates the present 
paper. 



For us the masters are players of winning strategies for the classes of process- 
control games described in Section[0]just below. Of course the experts behavi- 
orally cloned in the machine learning experiments mentioned just above aren’t 
necessarily playing exactly the same kinds of process-control games as we study 
herein, nor are they necessarily playing perfect, complete, winning strategies. 
Nonetheless, some of the parallels we describe, in the rest of this subsection, 
between these experimental machine learning results and our main theorems are 
very interesting and, we hope, instructive for the future. 



In this paper we study situations in which the learnability of strategies neces- 
sarily depends on the fact that the learner, in addition to exploring a complete 
description of the game, may also watch the behaviour of master players. For 
pedagogical purposes, some masters may be better to watch than others. In 
pi 1311 2p‘2 1 l‘22j it is noted that better results were obtained using the data from 
some pilots rather than others. Theorem 0in Section El below implies that some 
masters are strictly more helpful than others. Hence, we distinguish between 
whether we are using arbitrary or carefully selected mastersIU 

In \ I l,''l 1 212 1 tZ'Zj the learning program employed, C4.5 da, did not merely 
learn to copy identically each pilot modeled. We show in Section El for both 
arbitrary (Theorem 01) and selected masters (Theorem 01), that there are cases 
where one can learn winning strategies for process-control games from masters 
but not if one is required to copycat the master. An interestingly contrasting 
theorem in the same section (Theorem ED implies that, if a class of process- 
control games can be learned incrementally, i.e., after finitely many trial and 
error rounds, from arbitrary masters, then it can be incrementally learned by 
copycatting selected masters. 



^ In this paragraph, by master players we mean players who win, not players formally 
designated as Masters (as opposed to Grand Masters in chess, . . . ). 

^ In principle, in the case of Deep Blue, we could look at its actual strategic program, 
but, even Kasparov learned from watching Deep Blue’s behavior. 

® Formally, this distinction is handled definitionally by universal versus existential 
quantihers over masters in positive assertions (see Definition □ in Section 0 below) . 
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In the learning-to-fly project | |1 1,311 ‘2f21 122| it was discovered that C4.5 got 
confused if it received data from more than one pilot at a time. Seemingly con- 
trasting with this, in Section 0 below, we show, for both arbitrary (Theorem 0 
and selected masters (Theorem 01, for each m > 1, surprisingly, one can learn 
strictly more watching m -1- 1 masters than one can learn watching only 
Interestingly, the separation between learning from two and learning from one 
selected master(s) is witnessed by a class of games, which is essentially specified 
by the natural class of all trees that contain infinitely many infinite computable 
branches. 



1.1 The Process- Control Games 

In the present paper we focus on the learning of (programs for) winning strategies 
for two kinds of process-control games. The two kinds turn out to be, for all 
our purposes, mathematically equivalent 0 ! The second kind is mathematically 
elegantly simple, so we state and prove our results in terms of it, but, although, 
this second kind is interesting in its own right, more of our motivation comes 
from the first kind. Again: all of our results straightforwardly carry over mutatis 
mutandis to the first kind of process-control game. 

The process-control games of the first kind are called closed computable ga- 
mes. These games nicely model reactive process-control problems. The second 
are the one-player immortality games (synonymously: branch games). We de- 
scribe each in turn, the first informally (with references) and the second in more 
detail. 

To explain closed computable games, we show how to model an archetypal 
process-control problem as a closed computable game. Suppose we wish to keep 
the temperature t in a particular room between t^^^ = 18 °C and t^nax = 22 °C, 
inclusive, where the initial temperature is to = 20 °C. A temperature controller, 
which can sense the temperature in the room, and an unseen physical distur- 
bance each act at discrete times n = 0,1,2,... on the temperature of the room 
as follows. At time n, the controller and the disturbance can and do choose res- 
pective actions a„ and each in {—1, 0, 1}, where the resultant temperature, in 
degrees Celsius, at time n-|-l, is given by t^+i = f{tn, an, d„) = tn-\-an-\-dn. The 
controller sees the temperature, not the disturbance, and, from its perspective, 
the temperature behaves indeterministically, yet, the controller has to do well 
against all possible behaviors of the temperature and disturbance. Equivalently, 
the controller needs a winning strategy for the associated two-player closed com- 
putable game we describe next. Player I is the controller and Player II is the 
temperature. Of course we know that Player II is a mere puppet of Player I and 
the unseen disturbance, but Player I can see the temperature, so it is better to 
model Player II as the temperature. A play of the game is just an alternating in- 
finite sequence aotiait 2 ... of controller actions and temperatures, i.e., of moves 
of Players I and II. Player I wins the play aotiait 2 ... iff (Vn)[tmin ^max]- 

^ In the project on teaching an autopilot, a separate attribute distinguishing one pilot 
from another was not used; hence, this may explain the contrast. 
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The goal set for Player I is (by definition) the set of all plays aotiait 2 ■ ■ ■ where 
Player I wins. In topology, closed sets (by definition) contain their limit points. 
The game we have described is called closed since the goal set for Player I is a 
closed set. I.e., if every finite initial segment tj„ = a^tiai . . . of an infinite 

play yields no loss for the controller (i.e., if, for each n > 1, the temperature is 
between tmin and tmax at times m = 1, . . . ,n), then the limit point aotiai . . . 
of the SMuence (o’n)ne<^ is a win for the controller, i.e., is in the goal set for 
Player 10 An example winning strategy for Player I, the controller, is as follows: 

f +1 if < 20 or n = 0, , . 

\ —1 if tn > 20 and n> 0. 

We have defined the winning strategy m by an informal algorithm, or program; 
hence, it’s clearly computable. A human master playing strategy m would have 
stored in his/her head this or an equivalent program. Formally, the watchable 
behavior would be an enumeration of the pairs (t, a) such that t is a temperature 
that could be observed and a is this master’s response. 

Next we describe the mathematically equivalent one player immortality ga- 
mes. As an informal example, consider a robot which is placed in a (finite or 
infinite) environment. The robot’s job is to keep exploring its environment yet 
not get trapped or destroyed. To help it, it has a model of its environment, from 
which it can generate, for example, a map showing the dangerous spots. If we 
model finite environments as deterministic finite automata EH], then, in these 
cases, the one-player immortality game can be modeled as follows: Given a finite 
automaton, a winning strategy is an infinite word such that the finite automaton 
never visits a rejecting state when run on this word. 

Formally, and in general, a one player immortality game is (by definition) 
a computable tree containing at least one infinite (computable) branch. The 
player starts at the root, and its moves must take it successively further from the 
root. The winning strategies are exactly the infinite branches of the trees. The 
conventional, master-free, strategy learning scenario is: given an enumeration 
of the graph, or even a program, of the game tree, incrementally synthesize a 
program for following some such winning strategy, i.e., for traveling along some 
infinite branch. Death or entrapment is modeled, then, by the player getting 
stuck on a finite branch. 



1.2 The Power of Watching Masters 



As shown in |3, there are classes C of immortality games such that no machine 
can synthesize a winning strategy for every game G & C in the limit, given an 
enumeration of the graph, or even a program, of G as input. However, it is 
reasonable that one can overcome such limitations by presenting to the machine 
an enumeration of a winning strategy as additional input, that is, the learner may 
watch a master. In this work we study the power of (several variants of) this new 
learning notion. It is important to note that for all our results in Sections El and El 



For more formal treatment, see 
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where we compare different models of learning from masters, it does not matter 
whether the game tree is presented by a program, or by an enumeration of its 
graph! On the other hand, this matters when one compares master learning to 
conventional strategy learning. This comparison is the topic of Section 0] and 
there, we will also discuss the effect of the two different input models on this 
comparison. 

In order to demonstrate the additional power, which a learner may gain from 
watching a master, we consider the following illustrative example m- Let the 
tree Tg consist of branches fk = ka^ai . . . for every natural number k G u>, that is, 
Te has an infinitely branching root such that every successor k of the root is only 
extended by the branch . The branches fk may be finite or infinite, depending 
on the e-th recursively enumerable set We (of an standard enumeration of all 
r.e. sets). More precisely, we let Ooo . . . a„ be “on the tree” Te iff am is the smallest 
number such that iWe.a^l ^ for m = 0, ...,n. Here, {We,n)neoj denotes a 

finite approximation of We- For fc > 0, /cO" is on the tree, iff |ITe^„| < k. It is 
not difficult to see, that the branch /q is infinite iff We is infinite, and that, for 
k > 0, fk is infinite iff \We\ < k. Thus, every tree Te has an infinite branch. 

Assume now, that M is a machine, which computes, for all e, from a program 
of Te a sequence igii . . . imiH ■ ■ ■ which stabilizes on a program i for an infinite 
branch / = fk of Tg. Then it holds that We is infinite iff A: = 0. Note that 
k = /fe(0) = /(O) can be computed using the program i. Thus, we have a 
procedure which decides the index set Inf = {e: We infinite} in the limit, that 
is. Inf is Z\i with respect to the arithmetical hierarchy. However, as well known. 
Inf is 7T2-complete, which is a contradiction. 

Thus, a conventional branch learning machine can not synthesize infinite 
branches for all Te even if it gets a program of Te as input (instead of just an 
enumeration of Te). 

Now, consider a learner, who gets a program j for Tg and who watches an 
enumeration /(0)/(l) ... of an infinite branch of Te- Clearly, having seen /(O) 
the learner knows that the branch extending k = f(0) is infinite and can then, 
using the program j of Te, compute a program for the infinite branch /. This 
demonstrates that learning by watching masters may be extremely more po- 
werful than conventional branch learning. It even allows the learner to find a 
branch without any mind changes. Moreover, the learner can even identify the 
input master, instead of just learning any infinite branch of Tg. In this work we 
analyze this additional power gained by watching masters, and compare several 
interesting variants of this learning notion. 

Note that we could easily code e into the beginning of the tree Tg. Then 
it is still impossible to synthesize strategies for this class in the limit, however, 
a learner who watches a master can still be successful on this class, even if 
the learner gets only an enumeration of the graph of Tg as input. By similar 
reasons, all our results in Sections El and El are, as already noted, independent of 
the input model for the trees. Therefore, we will base our, now following, more 
formal treatment on the easier model, in which the learner gets a program for 
the game tree as input. 
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2 Learning from a Master 

The natural numbers are denoted by uj. uj* is the set of all finite sequences 
from UJ, and uj‘^ is the set of all infinite sequences from uj. We are using an 
acceptable programming system ipQ,(pi, . . the function computed by the e-th 
program within s steps is denoted by fe,s- REC is the set of all total computable 
functions. For strings u, r C w* U a t means that a is an initial segment 
of T. |ai . . . a„| = n denotes the length of a string ai . . .a„ € uj* . If a € uj* and 
T G UJ* Uuj‘^ then err is the concatenation of the two strings. Let (•) be a coding 
of UJ*, i.e., a bijective computable function (•) : uj* ^ uj, which is monotone with 
respect to subsequences: 

(Ver, T G w*)[(T is a subsequence of r => (ct) < (r)]. 

We identify finite strings with their code numbers. Total functions / : w — > w 
are identified with the infinite string /(0)/(l) . . .. We write f[n] for the initial 
segment /(O) ... /(n - 1) of /. 

T C UJ* is a, tree if T is closed under initial segments^ Elements of a tree are 
called nodes. If A C w* U is a set of finite and infinite strings, then the prefix 
closure, {a G uj* : a a for some a G A}, is a tree. We often will define trees 
by specifying only such a set A. A total function / : w — >■ w is an infinite branch 
of T if f[n] G T for all n G uj. 

For background from inductive inference see, e.g., M- Remaining computa- 
bility theoretic notation is from m- 

We are interested only in the class Tree of all computable trees which contain 
at least one infinite computable branch. If / G REC is an infinite computable 
branch of T we also say that / is on T. Moreover, in the context, when an / on 
T is given as input to a learner, such a branch is called a master. 

In what follows, for convenience, we will say branch to refer only to infi- 
nite computable branches. Furthermore, also for convenience, we will sometimes 
speak of learning a branch when we mean learning a program for the branch. 

Definition 1. A Turing machin^ M learns a branch from an arbitrary master 
for a tree T G Tree, if for all masters f on T and for all e with ipe = T, the 
sequence {M{e, f[n]))n^ui converges to an infinite computable branch ofT, i.e., 
there exists an i such that pi is an infinite branch on T and M{e,f[n]) = i 
for almost all For C C Tree we write C G ArbMa if there exists a Turing 
machine M which learns branches from an arbitrary master for every T G C. 

® The theory remains the same if it is based on trees over a finite alphabet, e.g., on 
binary trees T C {0, 1}* piij . 

^ By a well known argument, we can assume, without loss of generality, that all lear- 
ning machines considered in this paper are total El. 

® Note that the seqnence M{e,f[n\) converges syntactically to a (program for an) 
infinite computable branch, i.e., our notion of learning corresponds to the version 
of learning in the limit, which is called Ex-style (or incremental) learning in the 
literature. 
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A Turing machine M learns a branch from a selected master for a tree 
T G Tree, if there exists a master f on T such that for all e with tpe = T the 
sequence (M(e, /[n]))„g^ converges to a (program for an) infinite computable 
branch of T. For C C Tree we write C G SelectMa if there exists a Turing 
machine M which learns branches from a selected master for every T gC. 

If there exists an ArbMa- or SelectMa-fearner which converges to a pro- 
gram for the input master (instead of just to a program for any infinite eompu- 
table branch ofT) for every tree T of a class C, we say that C is learnable from 
arbitrary /selected masters identically. The eorresponding classes are denoted by 
ArbMald and SelectMald. 

The definitions directly imply ArbMa C SelectMa, ArbMald C ArbMa 
and SelectMald C SelectMa. One can prove that these inclusions are proper. 
Thus, to identify a master is a proper restriction for both learning from arbitrary 
and learning from selected masters. This shows that the advantage in watching 
one master (rather than none) comes from ones creating ones own winning stra- 
tegy, and not from being a copycat. This result is not as surprising for ArbMa 
since one can imagine masters who go out of their way to avoid being figured 
out. But for the selective version of master learning this result is much more 
interesting. It says that regardless of how skilled pedagogically is the selected 
master you are watching, if one can learn a winning strategy from him/her/it, 
then this is, in general, only possible by creating a new strategy which differs 
from that of the master. 

The noninclusion SelectMa / ArbMa shows that not all masters are 
equally helpful for a learner. We are even able to prove SelectMald / ArbMa. 
Thus, while watching some masters provides enough information to identify these 
masters, watching others may be absolutely useless. Surprisingly, the other di- 
rection of the inclusion, ArbMa C SelectMald, holds. I.e., if every master 
allows the learner to at least find some winning strategy, then there exists one 
master which can even be identified by the learner. In summary, this establishes 
the proper linear chain ArbMald C ArbMa C SelectMald C SelectMa. 
Moreover, it holds that Tree / SelectMa. I.e., even if a learner can watch “a 
most helpful” master and is only required to output any winning strategy, it is 
still not possible to learn such strategies for all games, which have one. 

One complexity measure of a learning task is the number of mind change^ 
which a machine needs to stabilize on a program for the target object. With 
respect to this complexity measure, learning without any mind changes at all 
provides the strongest positive results one may obtain. Zero mind change learning 
is also called finite learning in the literature. All the separation results which we 
give in this section are established by classes of trees so that the positive half of 
the separation result is witnessed by a machine which makes no mind changes! 



® In the formal definition of mind change one allows the machine to output initially a 
special symbol “?” to indicate that it has yet not seen enough data to make up its 
mind for its first conjecture. So a mind change is said to happen, if ? 7 ^ M{e, f[n]) / 
M{e,f[n+ 1]). 
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Theorem 2. ArbMa C SelectMald. 

Proof Sketch. This simulation proof is related to the proof of Freivalds and Wie- 
hagen that every computable function can be learned from an upper bound of 
any of its indices m 

Let M be an ArbMa- learner. Furthermore, for any given tree select that 
infinite branch which has among all computable infinite branches the smallest 
minimal index, say e. Now this branch / can be inferred as follows: 

The new machine N emulates M on f[n] and receives a sequence of 

hypotheses which converges to an index e' of some branch of the tree. By choice, 
e' is greater or equal to e. At each stage, N amalgamates all programs i below 
Cn which are consistent with the input master / on {0, . . . , n} during their first 
n computation steps. This algorithm amalgamates in the limit all programs with 
indices below e' which are consistent with / and thus, identifies the master / 
in the limit. Hence N witnesses that the class of trees ArbMa-learned by M 
can also be SelectMald-learned where the selected master is the one with the 
smallest index. □ 

Theorem 3. ArbMa % ArbMald. Moreover, this noninclusion can be witn- 
essed by a class of trees which is ArbMa-ZearnaWe without any mind change. 

Theorem 4. SelectMa ^ SelectMald. Moreover, this noninclusion can be 
witnessed by a class of trees which is Select'M.a.-learnable without any mind 
change. 

Proof Sketch. We will build trees Tg for all e G ui. Each tree Tf, has the form 
Te = UiGoj such that ei < a for all a G 17^, |cr| > 2, and each tree 17* 
contains at most one infinite computable branch /. In every tree 17* we will 
try to diagonalize against ipe as a potential SelectMald-learner for Tf,. We will 
diagonalize against (Pf, by securing that (Pf, makes enough “prediction errors” on 
/, that is, (pe{f[n\) is undefined or does not equal /(n) for infinitely many n. Since 
every SelectMald-learner M would yield a predictor Act G lu* .( pM(j,a){W\) for / 
on ipj = 17*, which is at most finitely often undefined or incorrect, this will suffice 
to diagonalize against all SelectMald-learners. Te will be uniformly computable 
from e. Since e is encoded into the master, it is therefore not necessary to give 
an index of Tg as input to the diagonalized machine. 

Furthermore, we organize this diagonalization in such a way that it will only 
succeed in 17° and in all trees 17* , for i > 0, such that <pi is an infinite branch 
of Te. All other trees 17* will be finite. This implies that for i > 0 every tree 17* 
contains an infinite branch iff ipi is an infinite branch of Te. This idea is the key 
to achieve C G SelectMa, since the branch / of such an infinite tree 17* with 
1 > 0 fulfills /(I) = i, that is, /(I) is a program for an infinite branch of Te. 
Therefore, these branches can be used as selected masters. 

Construction of Te = {ct* : i,s G uj}: 

Stage 0: 

For all j G w: CTq = Tq = ei, Xq = 2. 



Stage s-t-1 : 
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1. For all i > 0: 

If (Va; < x\)[tpi^s{x) I] and (3t < ^ cr^] then let 

x^s+i = a;* + 1. 

Else let 

'7’^ 

*^3 + 1 ~ 

{ Check whether ipi seems to become a branch of so that Ul may 
be extended in step 2.} 

2. For all i: 

If j = 0 or x\j^i > x\ then 

if (3ri)[Tg <Ti^(r\ and ^e,s{Ti) i], then choose the smallest 
such Ti and let 

Oj = S -I- 1 -I- iPein), 

cr*+i = rj+i = TiGi. 

Else let 

Oj = s -I- 1, 

cr*+i = aim, Tl+i = tI- 

Else let 

/T^ T"* ' t '^ 

^S + 1 — ‘^3 7 ^3 + 1 — ‘S- 

End of Construction. 

□ 



Theorem 5. Tree ^ SelectMa. 

Theorem 6. SelectMald ^ ArbMa. Moreover, this noninclusion can be 
witnessed by a class of trees which is SelectMald -ZearnaWe without any mind 
change. 

3 Hierarchies for Learning from Many Masters 

Definition 7. In the following we write icb(T) for the number of infinite com- 
putable branches of a tree T S Tree. Note that T may have infinitely many 
infinite computable branches, in which case we write icb(T) = oo. 

A Turing machine M learns a branch from arbitrary m masters for a tree 
T G Tree, if for all masters fi, ... , fm on T with 

|{/i,---,/m}| > min{m, zc&(T)} (2) 

and for all e with ipe = T the sequence (M(e, fi[n], . . . , fm[n]))neuj converges to 
a (program for an) infinite computable branch ofT. M learns a branch from sel- 
ected m masters for a tree T G Tree, if there exist masters fi, . . . , fm onT such 
that for all e with ipe = T the sequence (M(e, fi[n], . . . , fm[n]))nec.j converges to 
a (program for an) infinite computable branch of T. 

The corresponding classes are denoted by ArbMa™ and SelectMa™. 

Requirement o just above on the input masters of an ArbMa-Zearner ensures 
that: 
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1. If there are at least m distinct infinite branches on T, then, the m masters 
are pairwise distinct; 

2. if the number k of distinct infinite branches on T is < m, then exactly k of 
the m masters are pairwise distinct; and, 

3. hence, both ArbMa™ C ArbMa™“''^ and SelectMa™ C SelectMa™“'"^. 

Theorem 8. For allm> 1: ArbMa™ C ArbMa'"’’’^. Moreover, the noninelu- 
sion ean be witnessed by a class of trees which is ArbMa’"~'’^-fearna6fe without 
any mind change. 

The separation of ArbMa"*^^ and ArbMa™ can also be witnessed by the natu- 
ral class Cm+i of all computable binary trees from Tree which have at most m-|-l 
(arbitrary) infinite branches. However, the ArbMa™^^ learner for Cm+i needs 
generally log(m) mind changes (instead of zero mind changes). 

The analogous hierarchy result to Theorem 0 also holds for branch learning 
from selected masters: 

Theorem 9. For all m> 1: SelectMa™ C SelectMa™’*'^. 

This result, in particular, implies that even our most powerful notion of master 
learning, namely, learning an arbitrary branch from several selected masters, is 
not strong enough to learn a branch for every tree: 

Corollary 10. For all m > 1: Tree ^ SelectMa™. 

Theorem El can be proven using a team learning result from However, inte- 
restingly, the separation of SelectMa^ and SelectMa^ is witnessed by natural 
classes of trees, in particular, by the class Treeinf of all trees which contain in- 
finitely many infinite computable branches. Since such separation results, which 
are witnessed by natural classes, are particularly interesting for inductive infe- 
rence, we present here the proof of this result instead of proving the general 
statement from Theorem 0 

Besides Treeinf, also the following natural classes witness SelectMa^ % 
SelectMa^ (see, e.g., 0 for definitions): 

— the binary trees which contain only computable infinite branches, 

— the binary trees of bounded width, 

— the binary trees of bounded rank, and 

— the binary trees of bounded variation. 

Theorem 11. The class Treeinf of all trees, which contain infinitely many in- 
finite computable branches, is in SelectMa^ — SelectMa^. 

Proof. We first prove Treeinf G SelectMa^. So, let an arbitrary tree T G 
Treeinf be given, and let e' be the smallest e such that (pe is on T. Then 
there exists an x' such that for all e < e' , either Pe{.x) f for some x < x' , or 
Pe[x'\ ^ T. Since T contains infinitely many branches, T contains an infinitely 
branching node, or, for every n, there exist two different branches /i ,/2 on T 
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with fi[n] = This implies that there exist an n' and two different bran- 

ches /i, /2 on T with /i[n'] = / 2 K], /i(n') ^ h{n') and n' + fi{n') + f 2 {n') > x' . 
We choose these branches fi and /2 as selected masters. 

Now, the machine M witnessing Treeinf S SelectMa^ works as follows. On 
input T, /i, /2 the machine M waits until it finds the first n with /i(n) f 2 (n). 

Then it computes y = n + fi{n) + f 2 {n). Note that n = n' and y > x' . From 
now on, M outputs in stage n the smallest e < n such that 

(Vx < y)[ipe,n{x) i] and ipe[y] G T. (3) 

If no such e exists, then M outputs 0. 

By definition, there exists an n > e' such that e = e! and n satisfy On 
the other hand, no e < e' can satisfy Q for any n. Thus, from some point on, 
the machine will always output e' which is a program for a branch of T. Hence, 
M is a correct SelectMa^-learner for the class Treeinf. 

We only sketeh the proof of the negative statement. Assume by way of cont- 
radiction that the Turing machine M witnesses Treeinf G SelectMa. Let ijn 
denote the unique string with n = {r]n)- We will construct a tree T such that M 
fails to SelectMa-learn a branch for T as follows: 

Construction of T = M _ Tgi 

Stage 0: Tq = 0, queue^ = (e). 

Stage s+1 : Assume that queue ^ = {cfq, . . . , Oq). 

Compute j = M{e,cTQ), where iff, = T. {Recursion Theorem).} 

Check whether rjs satisfies the following conditions: 

(a) ao < rjs, 

(b) (Vr,(To ^ T ^ ?7s)[r G T^], 

(c) (3a; < |cro|)[v5j,s(a;) t] or {3x < \rjs\)[ipj^s{x) i ^ ysix)]- 

If T]s satisfies (a) - (c) then 
let Ts+i =TsU {t]s}; 

if M{e,T]s) 4^ j then let = (cti, . . . , cr,, ?7sl), (4) 

else let queue^j^i = queue^. 

If ? 7 s does not satisfy (a) - (c) then let T^,+i = Tg, queue^j^i = queue^. 

End of Construction. 

One can show that T is actually in Treeinf. In order to prove that M does not 
SelectMa-learn a branch of T one distinguishes the following two cases. 

Case 1: There exists a stage s with queue^ = {ctoj • ■ • j ct?} such that ao is 
never removed from the queue in any stage t > s. This implies that, for all 
masters / on T, / extends ao, and, furthermore, M{e,f[n]) = M{e,ao) for 
all n > I (To I- Thus, M converges on all masters to j = M{e,ao). Then, is 
total. However, in this case, from some point on, we will only extend branches 
which are inconsistent with ipj due to condition (c). Hence, ipj is no branch of 
T. Contradiction. 
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Case 2: Every string is eventually removed from queue g. Then, by construc- 
tion, M will make infinitely many mind changes on every master of T due to 
line 0 . Thus, for every master / of T, M fails to converge to a single program 
on input f. Contradiction. □ 

4 Master Learning Versus Branch Learning 

How is ordinary branch learning [SI I tij . where the learner can only inspect the 
tree but has no access to a master, related to master learning? As already noted in 
the introduction, this comparison is, in contrast to the results of Sections El and 0| 
effected by the input model for the tree. In order to make the differences clear, 
we denote, for Crit € {ArbMald, ArbMa, SelectMald, SelectMa}, the ver- 
sion of Crit, where M{e,f[n]) in Definition [D is replaced with M{T[n], f[n]), 
by Enum-Crit. 

It is well known that, in general, enumerations of graphs are far less 
useful then programs. Therefore, it is not so surprising that ArbMald 2 
Enum-SelectMa. In other words, if one is working with an enumeration of 
the tree, then even the most powerful master learning notion fails to capture the 
most restrictive master learning notion working with programs of the trees. This 
result, in particular, implies that Enum-Crit C Crit for all master learning 
criteria Crit. 

The most powerful branch learning notion from jS] is called weak Bc- 
leaming^^A class C C Tree is in BranchWBc, if there exists a Turing machine 
M such that for all T G C and for almost all n, the function TM{T[n]) is an infinite 
computable branch of T, where the tree T C w* is identified with its characte- 
ristic function. Note that a BranchWBc-learner is allowed to make infinitely 
many semantic mind changes. 

A machine synthesizes branches for a class C C Tree in the limit if it computes 
from every index of a tree T G C a sequence of programs, which converges 
(syntactically) to a (program for an) infinite branch of T. The corresponding 
class is denoted by SynthLimO 

By using the trees from Section II .21 one can show that no “pure branch lear- 
ning class” captures any “master learning classes”, demonstrating the extreme 
power of learning from masters: 

Theorem 12. Enum- ArbMald 2 (BranchWBc U SynthLim). 

What can we say about the other direction, that is, which master learning classes 
capture which branch learning classes? First, one can show, that the master 
learning notions which are working with enumerations of trees are too restrictive: 

Theorem 13. 

BranchWBc 2 Enum-SelectMa and SynthLim 2 Enum-SelectMa. 

The criterion is not very restrictive (weak), but, then, many things can be learned 
with respect to it, so, from that perspective, it is powerful. 

SynthLim is equivalent to the learning criterion called Uni[A] in ^11 ti| . 
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However, by definition, SynthLim is a subset of ArbMa. One can show that this 
inclusion is proper. Hence, we have SynthLim C ArbMa C SelectMald C 
SelectMa. The classes SynthLim and BranchWBc are incomparable [^, that 
is, there are classes in BranchWBc for which one cannot synthesize branches in 
the limit. Thus, it is interesting to see whether one of the three master learning 
notions extending SynthLim are powerful enough to capture BranchWBc. 
It turns out that while ArbMa is still too restrictive to cover BranchWBc, 
the power of a SelectMald-learner suffices. Thus, if a learner gets the index 
of the tree and can watch the right master, then the learner can improve its 
weak performance of infinitely many semantic mind changes to the very strong 
syntactic convergence to the master in the limit. The next and the last theorem 
of the paper formalizes this positive mind change complexity reduction result, 
which has an interesting simulation proof: 

Theorem 14. BranchWBc C SelectMald. 

Proof. Let C be in BranchWBc. Given an index of a tree T G C, the set E of 
all programs output by the learner is uniformly enumerable and depends only on 
the graph of T and not on the index of T. The indices in E define the following 
subtree T' C T: 

trST'ocrSTA (Be G E) (Vx G dom(a)) [ipe(x)f= o"(x)]. 

T' is enumerable but may not be computable. Furthermore, for all indices of 
T, the strings in T' are enumerated in the same order. In addition, T' has only 
finitely many finite branches, which cannot be extended to infinite branches, 
that is, almost all nodes of T' lie on an infinite branch of T' . So, knowing the 
enumeration of T' , it is possible to extend almost every node a of T' to an infinite 
branch of T' . This branch u„ is defined inductively after initializing it with a. 
Let now n be the first value not yet defined, then let 

{ a if a is the first number such that UCT(0)ucr(l) • ■ • U(r(n — 1) a 
is enumerated into T'; 
t otherwise, if there is no such a. 

Every function is either an infinite branch of T' or is a finite branch which 
cannot be extended and differs from all infinite branches of T' at some value. 
So, the following two propositions hold: 

T ^ a and Ur[n] ^ Ua- => Ur(n) 4- (5) 

(J < T < Ua => Ucr = Ut (6) 

Since the BranchWBc-learner produces at least one infinite branch, T' has at 
least one infinite branch. Moreover, if cr is a node of T' which is longer than every 
finite branch, then the function is infinite. Taking the master / = Ua, the 
following algorithm learns this branch. The learner starts with u\. Whenever for 
the current t an argument n is found with f(n) then r is extended to 

/(0)/(l) . . . f(n) and a mind change to this new Ur is made. After finitely many 
mind changes, Ur = and the SelectMald-learner succeeds. □ 
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5 Future Work 

As already mentioned in the introduction, the experts, which have been obser- 
ved in the behavioral cloning experiments, are not playing necessarily winning 
strategies. Also master chess players do not win every play: instead of being clas- 
sified into winners and losers, real players are ranked, more or less continuously, 
from bad to very good players. In the present basic study of watching masters 
we abstracted from this fact. But in future work it would be interesting to study 
learning from imperfect masters. At first instance, the crucial question for this 
research is, how to model imperfect masters. For example, one may consider 
masters which are playing finite variants of winning strategies. Or one may as- 
sume, that one of m input masters, or a majority of them, knows the best move 
in each situation. Moreover, for imperfect masters the problem of on-line lear- 
ning is no longer trivial |^! It would be interesting to investigate probabilistic 
learning from imperfect masters. The performance of an on-line learner can be 
measured by the number of lost plays, until it is eventually playing perfectly. Is 
there a connection between the quality of the input masters and the number of 
plays which an on-line learner loses? 

The one-player immortality games given by a deterministic finite automaton, 
as described in Section O, are just a special case of the well known two-player 
finite-state games wm- In such games there always exist winning strategies 
which can be executed by a finite automaton. In PS! it is investigated whether 
one can efficiently learn strategies for one- and two-player closed finite-state 
games from membership and play queries, where membership queries involve 
asking whether a certain position is already a loss, and play queries involve 
asking whether a certain finite automaton implements a winning strategy. It 
would be interesting to apply the master learning concept to this situation. Can 
the time and query complexity of a learner be improved if the learner can ask 
queries about a winning automaton? 

In Section 0 we have stated natural examples witnessing the separation of 
ArbMa™"''^ and ArbMa™, and of SelectMa^ and SelectMa^. It would be 
interesting to look whether further natural examples separate some other levels 
of the SelectMa™ hierarchy. 
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Abstract. In this paper we investigate in which cases unions of identi- 
hable classes of recursive functions are also necessarily identifiable. We 
consider identification in the limit with bounds on mindchanges and an- 
omalies. Though not closed under the set union, these identihcation types 
still have features resembling closedness. For each of them we find such 
n that 

1) if every union of n — 1 classes out of I/i, . . . , Un is identifiable, so is 
the union of all n classes; 

2) there are such classes Ui, . . . ,Un-i that every union of n — 2 classes 
out of them is identifiable, while the union of n — 1 classes is not. 

We show that by finding these n we can distinguish which requirements 
put on the identihability of unions of classes are satishable and which 
are not. We also show how our problem is connected with team learning. 



1 Introduction 

This paper considers a problem in inductive inference of recursive functions. 
E. M. Gold in El introduced the paradigm of identification in the limit: the 
identification strategy receives data on the object to be learned (a language, for 
instance) in the input, and produces an infinite sequence of hypotheses (cha- 
racterzining this object) that must stabilize on some correct final value. In this 
paper we will concentrate on identification of total recursive functions. Many 
modifications to the Gold’s model of learning have been proposed, such as pre- 
diction 0, hehaviourally correct 0, probabilistic 0, and consistent identification 
ca, co-learning 0, identification of minimal Godel numbers [a. 

Each such modification introduces a new identification type. One of the first 
questions that arises after introducing a new identification type is: “Is it closed 
under the operation of set union?” I. e., is the class of functions U\ U U 2 iden- 
tifiable if classes U\ and U 2 are identifiable? This problem is solved for most if 
not for all of the known identification types. The first such result was proved by 

* Supported by ^ NSF Grant 9301339 and ^ Latvia Science Council Grant 96.0282. 
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E. M. Gold: he showed that there are two languages that are identifiable in the 
limit, while their union is not El. A similar result for the case of total recursive 
functions was obtained independently by J. Barzdiiis in and by L. Blum and 
M. Blum in |^. 

After these results it seemed natural that, whatever requirements we put on 
the identifiability of classes and their unions, there are such classes that satisfy 
these requirements. However, in Q it was shown that there are unsatisfiable 
requirements as well. It turned out that EX nonetheless has a property much 
resembling closedness: if all the unions of classes C/1UC/2, t/iUC/a and C/2 U 1/3 are 
identifiable, then Ui U C/2 U C/3 is identifiable, too. We can formalize this property 
as follows: we consider an identification type to be n-closed if for every n classes 
of recursive functions, if all the unions of n — 1 of these classes are identifiable, so 
is the union of all n classes. It turns out that to distinguish between satisfiable 
and unsatisfiable sets of requirements we have to find the least n for which the 
identification type is n-closed. In P this problem was solved for some cases of 
identification in the limit modified by bounds on the number of anomalies (see 
P and jO]) and on the number of mindchanges (see and p]). 

The purpose of this paper is to show the complete picture of n-closedness 
of identification in the limit with bounds on mindchanges and anomalies (these 
are the most often considered modifications of identification in the limit) and to 
solve the problem of satisfiability of requirements. 

Papers [2 pi tij deal with a similar problem in language learning and team 
learning. 

After the preliminaries in Sect. ^ we define n-closedness and point to its 
connection with team learning in Sect.lBl In Sect.EJwe show how the satisfiability 
of requirements problem depends on n-closedness properties. In Sect. Owe solve 
the n-closedness problem for the considered identification types. Sect. O contains 
summary of the results. 

2 Preliminaries 

Any recursion theoretic notation not explained below is from m- IN denotes 
the set of natural numbers, {0,1,2,...}. * denotes “an arbitrary finite (natural) 
number.” In inequalities (Vn G IN)[n < * < 00]. (•,...,•) denotes a computable 
one-to-one numbering of all the tuples of natural numbers. 

Let TZ denote the set of total recursive functions of one argument and V 
the set of partial recursive functions of one argument. If f{x) is undefined, we 
write /(x) By /(x) },= y we mean that /(x) is defined and equal to y, /(x) { 
means that /(x) is defined. If f,g GP, a gJNU {*}, then f =°- g means that 
card({x € IN | /(x) 7/ 5(2:)}) < a. These a points of difference are called 
anomalies. If / G 72., denotes (/(O), /(I), . . . , f{n)). 

We fix a Godel numbering of partial recursive functions (cf. d) and denote 
it by g). 

An identification strategy F is an arbitrary partial recursive function. It re- 
ceives as input /I"! — the initial segment of the target function / G 72. We will 
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refer to its output as a hypothesis on the function /. A mindchange is 

an event when F(/["l) and f (/[”+il) are both defined and different. 



Definition 1. Let a, b 

function f € TZ (f € iff: 



G IN U {*}. A strategy F EX^ -identifies 



1. (3iV)[(Vn < iV)[F(/H) t] A (Vn > iV)[F(/N) ;]]; 

(3h)[(V“n)[F(/H) ;= h\ A =“ /]; 

3. the number of mindchanges made by F on f does not exceed b. 



Definition 2. jl H^llDItij A class U C TZ is EX^-identifiable (U G EX^^ iff 
(3EGiP)[f7CEX“(F)]. 

The following relationship has been established between these identification 
types. 

Theorem 1. j0| fia,b,c,d G IN U {*})[EXj C EXJ^ ^ a < c Ab < d\. 

In general, we define an identification type by the following scheme. 

1. I-identification is defined as a mapping A4 — ?> P{TZ), where Ai is the set of 
the subjects performing identification (in this paper, the set of strategies), 
and P{TZ) is the set of all the subsets of TZ] P{M) is the set of all the functions 
identified by M G A4; 

2. a class of functions U C TZ is considered I-identifiable iff (3M G M)\U C 

AM)]] 

3. the identification type is characterized by the set X = {U C 7^ | U is I- 
identifiable}. 

3 n-Closedness 

Here we define n-closedness and list some of its properties. 

Definition 3. An identification type X\ is n-closed in X 2 (n>l) iff 

n n 

(VC/i,...,C/„Gli)[(Vi| l<i<n)[ U [/, Gli] ^ IJ fJ, GI 2 ]. 

i=i 



Definition 4. uni An identification type X is n-closed (n > 1) iff X is n-closed 
in X. 

So “2-closed” is the same as “closed.” The following propositions can be 
easily proved by set-theoretical considerations. 

Proposition 1. IfXi is n-closed in X 2 , then X\ C X^. 
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Proposition 2 . If I2 is n-closed inl^, X\ C X2 andX^ C X4, thenXi is n-closed 
in I4. 



Proposition 3 . Let X\ he n-closed in X2- Then X\ is m-closed in X2 for all 
m > n. 

Proof. Suppose Xi is n-closed in I2, tn > n. Suppose sets C/i, . . . , Um G Ti satisfy 
the property (Vi | 1 < i < Uj G Xi], Define Vi = Ui,. . . , K-i = 

Un-i,Vn = \JT=nUj- ^e have K G Xi because K C UjL2 ^ 

U”=i G Ti because U”=i C Uj G Ii. Thus, 

n 

(V* I 1 < i < n)[ U Vj- gXi]. 
i=i jV* 

Since Ii is n-closed in X2, Uj=i ^7 = U^i G T2. 

The proposition shows that to characterize the n-closedness properties of Ii 
in X2 we need to find the minimal n for which Ii is n-closed in X2 . 

Definition 5 . We say that n is the closedness degree of Xi in a superset I2 
(n = csdeg(Ii, I2 ) ) iff n is the smallest number such that X\ is n-closed in X2- 
If such n does not exist, we define csdeg(Ii,l2) = oo- 

We will call cdeg(I) = csdeg(I,I) the closedness degree ofX. 

From Proposition Eland Theorem E] we get: 

Proposition 4 . If a\ < 02, hi < 62, Ci < C2 un d\ < c?2, then 

csdeg(EX“^^EX^() > csdeg(EX;VEX==J. 

It turns out that the problem of finding the closedness degree is equivalent 
to a problem in team learning. According to this model, many strategies parti- 
cipate in the identification, and we require only a certain amount of them to be 
successful. Team learning was suggested by Case and first investigated by Smith 
m- The general definition is due to m- 

Definition 6. LetX he an identification type. U QIZ is X -identifiable hy a team 
“k out of I” (we write U G [k,l]X, l<k<l) iff there is a “team” of I strategies 
such that every function from U is X-identified by at least k of these strategies. 

Proposition 5 . X\ is n-closed in X2 iff [n — l,n]Ii C X2. 

Proof. Suppose Xi is n-closed in X2 . Let U G [n — 1 , n]Xi , and let F\, ... ,Fn be 
the team that [n — 1 , n]Ii -identifies U. We define Ui = {f G U \ (Vj i)[f G 
XffFj)]}. Clearly, (Vj | 1 < j < «)[Ur=i — ^i(-^i)]- Since Xi is n-closed in 

l2,[S(^iU, = U GX2. 

Now, suppose [n — 1 , n]Ii C X2. Let U\,. . . ,Un be such sets that (Vj | 1 < 
j < ^)[Ur=i G Xi]. Let Fj be the strategy that identifies UILi Then 

the team Fi,. . . ,Fn [n — 1 , n]Ii-identifies Ur=i Ur=i G l2- Therefore, 

Xi is n-closed in X2. 
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Corollary 1. cdeg(I) = n iffn is the minimal number for which [n—1, n]I = I. 
cdeg(I) = c» iff for all n S IN; I C [n — 1, n]T. 

Corollary 2. csdeg(Ii,l 2 ) = n iff n is the minimal number for which [n — 
l,n]Ii C I 2 . Otherwise csdeg(Ii, I 2 ) = 00 . 

4 Satisfiability of Requirements 

Suppose we have a set of requirements on the EX^-identifiability of every union 
of some classes out of Ui,U 2 , ■ ■ ■ ,Uk We want to find a simple criterion for 
distinguishing if this set of requirements is satisfiable. 

A convenient way for expressing such requirements is to use the Boolean 
functions. We will write Boolean vectors in boldface and their components in 
italics with indices. A vector x S {0, 1}*^ corresponds to the union IJx =1 
/ : {0,1}^ — ?► If /(x) = 0, we demand that the corresponding union is 

identifiable. If /(x) = 1, the corresponding union must be unidentifiable. 

Definition 7. Let a,b G IN U {*}. A Boolean function f : {0,1}^ — >■ {0,1} is 
-satisfiable iff {3Ui,...,UkC 7^)(Vx e (0, l}'=)[U..=i U, G EX^ /(x) = 

0]. 

Which of the properties of EX^ are relevant for the satisfiability of Boolean 
functions? Two properties are immediate: EX^ contains the empty set and to- 
gether with a set EX^ contains all its subsets. Q showed that another property 
is relevant: the closedness degree. The following definition combines these three 
restrictions. 

Definition 8. m A Boolean function f : (0, 1}^ — >■ (0, 1} is n-convolutional 

tff 

1. /(O) = 0; 

2. (Vx,y e {0, < y ^ /(x) < /(y)] (monotonicity); 

3. (Vx e (0, ...,in\f<ii<---<in<kAx^^ = ... = x^^ = l)[(Vr | 

1 < r < n)[/(a;i, . . . , 0, . . . ,Xfc) = 0] ^ /(x) = 0]. 

The next result shows that the n-convolutionality is the criterion that we 
desire. 

Theorem 2. Let a,b G 1NU{*}. //cdeg(EX^) = n S IN, then a Boolean function 
is KX'( -satisfiable iff it is n-convolutional. 

//cdeg(EX^) = 00 , then a Boolean function f is KX^ -satisfiable iff f{0) = 0 
and f is monotone. 

Proof. At first we prove the necessarity. Suppose a function / : (0, 1}^ — >■ (0, 1} 
is EXj -satisfiable. Let Ui,...,Uk be the classes that satisfy the requirements. 
Then, because of the mentioned properties of EXj , /(O) = 0 and / is monotone. 
Suppose cdeg(EX^) = n G IN. Let x be an arbitrary vector from {0,1}^. Let 
ii, . . . ,in be such that 1 < ii <...< in < k and Xi^ = ... = Xi^ = 1. We define 
1 < j < n, to be such vectors that 
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1 - vl, = 1 , 

2. yl =0 for r ^ j, 1 < r < n, 

3. = Xs ior s € . ,k} - {ii,. . 

Let Vj be the union of Ui,...,Uk corresponding to the vector . Then the 
vectors {xi, . . . ,Xi^-i,0,Xi^+i, . . . ,Xk), 1 < r < n, correspond to the unions 
of n — 1 classes out of Vi, . . . , Vn- If these are EX^-identifiable, so is 
because EX^ is n-closed. Since Uj=i ^ corresponds to the vector x, we have 
proved that / is n-convolutional. 

Now, sufficiency. 



Definition 9. A vector x zs a minimal 1-vector for a Boolean function f iff 

1. /(x) = 1 and 

2. (Vy < x)[/(y) = 0]. 



Let x^, 1 < j < t, be all the minimal 1-vectors for /. Let Uj be the number 
of components in x^ that are equal to 1. Suppose that cdeg(EX^) = n G IN and 
/ is n-convolutional. According to point 3 in the definition of n-convolutionality, 
Uj < n for every j G {1, ■ . . ,t}. Suppose cdeg(EX^) = oo, /(O) = 0 and / is 
monotone. Then, trivially, every Uj < oo. 

So, in both cases EX^ is not n^ -closed, j G {l,...,t}, and there are such 
classes Uf, . . . ,Uf^. that every union of Uj — 1 out of them is identifiable, while 
Urii Uf is not. 

Now we construct the classes Ui,. . . ,Uk that satisfy the requirements given 
by /. Suppose xj = 1 for some 1 < z < A: and 1 < j < t, and suppose xj is the 
p-th component of x^ that is equal to 1. Then for every function / G we put 
the function 






fj, x = 0 

[ f{x - 1), X > 0 



in Ui- The class Ui contains all the functions generated by this rule for different 
values of j and no more. 

Suppose /(x) = 1. Then for some j, y2 < x, and the corresponding union 
contains the functions f we constructed from every function / G [s:uui i 
EXj, so it is unidentifiable. 

Suppose /(x) = 0. We construct a strategy F that identifies the correspon- 
ding union. F reads f'{0) = j in the input. According to the monotonicity, there 
is such s that Xg = 0 and xf = 1. Suppose xf is the p-th component equal to 
1 in x^. Then, extracting f(x) = f'{x + 1) from the input, we get a function / 
that belongs to Ur=i z/p EX^-identifiable. So F can use the strategy 

that identifies this class. □ 



Now, to solve the satisfiability problem for EX^, we have only to find the 
closedness degrees of EX^. 
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5 Closedness Degrees 

In this section we find the cdeg(EX^) values. 

The first result in the whole area of the closedness of identification types for 
total recursive functions was the next theorem. 

Theorem 3. m There are such classes Ui,U 2 Q TZ that Ui S EX, U 2 S EX, 
and C7i UC /2 ^ EX*. 

So, csdeg(EX, EX*) > 2. Then, in team learning, the following result was 
obtained. 

Theorem 4. |I3| (Va £ IN U {*})[[2, 3]EX“ C EX“]. 

Using Proposition 0 and Corollaries [H and 0 we get: 

Theorem 5. (Va G IN U {*})[cdeg(EX“) = 3]. 

Now we will consider the identification types EXf, and EXJ, 6 £ IN. Theorem 
iniis a generalization of Theorem 4.2 in p. 

Theorem 6. (V& £ lN)(Va,a' £ IN U {*} | a' > 2^+ia)[csdeg(EX;), EX))') < 
2^’+^]. 

The proof of the theorem is based on a lemma. 

Lemma 1. For all 6 £ IN, a, a' £ IN U {*}, such that a' > 2^+^a, there is 
an algorithm that can EX)) -identify any function f £ TZ knowing ( receiving as 
parameters) algorithms of — I strategies such that each of them produces at 
least one hypothesis on f and at least — 2 of them -identify f. 

Proof. Let strategies Ei, F 2 , . . . , F 2 t+ 2 _i and a function / satisfy the conditions. 
The algorithm F redirects its input to the strategies E) until they output hy- 
potheses hi, i = 1,2, .. . ,2^+^ — 1. Then E produces a hypothesis h such that 
<Ph{x) = y iff at least 2*”+^ of the values iphXx), * = 1,2,..., 2^+^ — 1, are y. 

In case b > 0, F waits for 2^+^ — 1 of the strategies Fi to make a mindchange. 
Suppose it happens. Then, to EX^-identify /, these strategies can make no more 
than 6—1 mindchanges from now on. So E selects these 2^+^ — 1 strategies, 
disregards their hypotheses made before the mindchange and applies to them 
the algorithm corresponding to the case of EXb_i-identification. This algorithm 
identifies / with no more than 6 additional hypotheses and with no more than 
2^0 anomalies, so / £ EX^ (E). 

Suppose no more than 2^+^ — 2 strategies make a mindchange or 6 = 0. 
Then among hi there are no more than 2^~^^ — 1 hypotheses with more than a 
anomalies, so (ph can have an anomaly only at the points where at least one of 
the remaining 2^+^ hypotheses have an anomaly, that is at no more than 2^+^ a 
points. 
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Proof of Theorem^ It is sufficient to prove that EXj is 2^+^-closed in EX^ . 

Let Ui,U 2 , ■ ■ ■ , U 2 h +2 C 72. be such classes that all the unions of 2^+^ — 1 
classes out of them are EX^-identifiable. Let Fi, F 2 , . . . , F 2 b +2 be the strategies 
that identify these unions. We will construct a strategy F that EX^ -identifies 

The strategy F redirects its input to the strategies Fi until 2^+^ — 1 of 
them output a hypothesis. Such an event happens because every function / G 
2^'^‘^Uj belongs to 2^+^ — 1 of the unions of 2^+^ — 1 classes, thus at most 
one of the strategies Fi does not identify /. 

Then F selects these 2^+^ — 1 strategies, applies the algorithm from the 
previous lemma and identifies the input function. □ 

The next theorem is a generalization of Theorems 3.1 and 4.1 from 

Theorem 7. (V& G IN)[csdeg(EX;,,EX^) > 2^+^ — 1]. 

We will use the idea whose origin is the concept of “self-describing” functions 
used in ^ Theorem 2]. We will use functions that output instructions for EX^- 
identification of themselves. Even more, they will output many arrays of such 
instructions. The instructions will be of three kinds. 

1. An elementary instruction (l,j, i, n), i,j > 1. Informally, it proposes n as 
the 7-th hypothesis in the j-th array of instructions. 

2. A compound instruction (2, yi, . . . , y^), where y^ are elementary instructions. 
In this way many elementary instructions can be incorporated in one value 
output by a function. 

3. A split instruction. It consists of two values, (3,7,yi,y2) and (4, 7,y3,y4), 
where yi — y2 + 2/3 ~ V 4 is an elementary or a compound instruction, and i is 
a unique identifier for this pair of values. In this way an instruction can be 
split into two parts so that by changing any of these parts we can obtain a 
different instruction. 

Among the values f{x) there must be exactly one value of kind (3,7,-) and 
exactly one value (4,7,-) to get a split instruction with identifier 7. Naturally, 
other kinds of instructions can be designed to prove similar results for identifi- 
cation types not considered in this work. 

Let Instr(/) be the set of elementary instructions output by /, including 
those that are contained in the compound and the split instructions. 

Definition 10. We will say that a function / G 72 7s a j-instructor with respect 
to the -identification (a, 6 G IN U{*}j iff there is an instruction (l,j,c, n) G 
Instr(/) such that (fn =“ f, c<b-\-l and, 7/ (1, j, c', n') G Instr(/) for some d 
and n' , then c' < c or n' = n. 

Let us denote the class of j-instructors with respect to EX^ by . It is 
easy to see that I^^’’ G EX^. 
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Proof of Theorem^ Let us denote k = 2^+^ — 1. Define Ui = (Hj/i where 

i,j G [1, k]. Then U, C lf^» G EXfe. 

We will prove that Uj=i ^ EXJ. Suppose there is a strategy F that identi- 
fies this union. The multiple recursion theorem (see ^7]) lets us construct func- 
tions that use each others Godel numbers as parameters. We construct functions 
iPm one of which will be the function from Ui not identified by F. 

The algorithm below uses a procedure new(a;). It lets x -4— ric, and then 
c <— c + 1, where c is a counter in the algorithm. The algorithm describing 
is as follows. 

- Stage 0. 

Let c = 1, j = 0, p = k, D = {p}. 

Execute new(si) for 1 < i < p — 1. Output values as shown in the next table. 





0 


p-2 


Psi ) ■ • ■ ) Psp-i 


(1,1,1, Si) • 


■ (1,P- l,l,Sp-l) 0 



The leftmost column contains the functions defined, other columns show 
values output at the corresponding inputs. The rightmost column means 
that these values are output up to infinity unless the algorithm goes to the 
next stage. 

Let the variable y throughout this algorithm indicate the maximal value of 
argument at which the values have been output at the moment. We simulate 
the strategy F on the initial segments of <psi- If a hypothesis is output on 
p^s} for some cc, we let h = xq = max(a;,y) -I- 1; we output () up to 

Xq — 1, if needed, and go to stage 1. 

- Stage m (1 < m < b + 1). 

Let r = card(D), I = (p — l)/2. 

Let di, . . . ,drhe the elements of D. Execute new(t), new(iti) for 1 < i < l—l, 
new(t'), new(wi) for 1 < i < Z — 1. Output values as shown in the next table. 





xo ... xo + r — 1 


^Si + l 5 ■ ■ • ; ^Sp_l 5 5 ■ • ■ 5 ^Ui_i 


{l,di,m,t) ... {f,dr,m,t) 
(1, di, TO, t') ... (1, dr, xn, t') 




xo + r 


^Si ; ■ ■ • 7 ^Si ; ^Ui j • - ■ j 

■ ■ • ; 1 ^U\i • ■ ■ 1 ^Ui_i 


{l,j + l + l,m+ l,Mi) . . . 

(1,J + l,TO-h l,Ui) ... 




Xq -\- r -\- 1 — 2 ... 


^Si 7 • ■ • ; V^Sz 7 7 ■ ■ • 7 

9^SZ + 17 ■ ■ • 7 ^Sp_17 *V^t'7 ^U\i • • • 7 V^liZ — 1 


(1, j -1- 2/ - 1,TO-|- 0 

(l,j -1- Z - 1,TO-|- (0) 



If TO = 6 -|- 1, the algorithm remains in this stage forever. 

If TO < 6 -|- 1, we simulate F on functions psi and Psi+i- 

If F changes the current hypothesis h on for some x, we let h = 

xq = max(x,y) + 1, output () up to xq — 1, add j + 1, . . . ,j + I, j + p — 1 to 

D, let Si = Ui for 1 < i < I — 1, j = j + I, p = I and go to stage to -I- 1. 
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\x] 

If F changes the current hypothesis h on (psi+i for some x, we let h = 
^0 = max(a;, y) + 1, output (0) up to xg — 1, add j + 1, . . . , j +p — 1 
to £), let Si = for 1 < i < Z — 1, p = / and go to stage m + 1. 

Let us explain the meanings of variables at the start of stage m. Si are Godel 
numbers that have been proposed as the m-th hypotheses in the instructions. 
The indices of these instructions begin with j + 1 and their amount is p — 1 = 
2b+3-m _ 2, D contains the indices of the arrays of instructions for which the 
TO-th hypothesis has not been proposed yet. 

At stage m two alternatives represented by (ps^ and Psi+i are proposed for 
F. Since they differ at infinitely many points, the last hypothesis h cannot be 
EXj-correct for both of them. If F does not make a mindchange on any of the 
two alternatives, the algorithm remains at stage m forever, , (fisi+i C U^=i 
and at least one of these two functions is not EXj-identified by F. If F makes a 
mindchange on one of these alternatives, the algorithm switches to stage m+ 1, 
choosing this alternative for further consideration. At stage 6+1 F cannot output 
a new hypothesis since it already has made 6 mindchanges. So F does not identify 
the union. Contradiction. □ 



Corollary 3. (V6 G lX)[cdeg(EXt,) = cdeg(EX^) = 2 ^+^]. 

Lastly, we consider the case of EX^-identification, where a, 6 G IN, a > 0. 
The results turn out to be rather surprising. For a = 1, the closedness degree is 
finite and still grows exponentially relative to b, while for a > 2 the closedness 
degree is oo. 

Theorem 8. (V6 G lN)[cdeg(EX^) > ^ ~^ ]. 

Proof. Let us denote k = ^ 

We define U, = (fl.ti if^h, l<i<k. Then Uti Q G EX^, 

l<j<k. 

We will prove that lj?=i ^ EX^. Suppose F is a strategy identifying this 
union. We define functions (pm described by the following algorithm. 

- Stage 0. 

Let c = 1, J = 0, p = (7 • 6^+^ — 2)/5, D = {p}. Execute new(si) for 
1 < i < p — 1. Output values as shown in the next table. 





0 


p-2 


Psi ) ■ • ■ ) +Sp_l 


(l,l,l,si) ., 


•• (l,p- 0 


/o 


(l,l,l,Si) ., 


• ■ l) () 



The function under the last horizontal line (/o in this case) is the function 
not identified by F in case the algorithm remains in this stage. 

Let the variable y throughout this algorithm indicate the maximal value of 
argument at which the values have been output at the moment. We simulate 
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the strategy F on the initial segments of /q. If a hypothesis is output on 
we let h = xq = max(x,y) + 1; we output () up to xq — 1, if 

needed, and go to stage 1. 

- Stage m (1 < m <b + 1). 

Let r = card(Z?). Let di, . . . ,dr be the elements of D. Execute new(ti) for 
I < i < r. Go to substage 1. 

- Suhstage 1. 

Let u= {p- 2)/2, yi = (3, 2m - 1, 0, 0), Z 2 = (2, (1, di, to, h), . . ., (1, 
dr, TO, tr)), j /2 = (4, 2to — 1, Z 2 , 0). Output values as shown in the next 
table. 





Xo Xo + 1 . . . 






? 


2/2 


0 


^ S-u + \ 1 ■ 


* ■ •> 


yi 


7 


0 


V’sp-l 




? 


? 


0 


Vh,--- 




yi 


d2 


0 


/ 7 m— 6 


yi 


2/2 


0 



The question marks mean that the values are not output at these points 
as yet. We compute iph{xo), + 1) and the outputs of F on f^rn-e- 

If TO < 6 + 1 and F changes its current hypothesis on for some x, 

we assign h the new hypothesis value, replace question marks with the 
corresponding values of fr-m-G, let xq = max{x,y) + 1, output () up to 
Xq — 1, add j + (p — 2)/6 + 1, . . . , j + p — 1 to D, let p = (p — 2)/6 and 
go to stage TO + 1. 

If <Ph{xo) = pi, let xi = p + 1, and go to substage 2. 

If <ph{xo + 1) = P 2 , let xi = p + 1, and go to substage 5. 

- Substage 2. 

Let V = (p— 2)-2/3, w = (p— 2)-5/6. Execute new(s^) for w+1 < i < p— 3. 
Let p 3 = (3,2to, 0,0), Z 4 = (2, (I, j + w + I, to + 1, . . . , (l,j + p- 

3, TO + I, Sp_ 3 )), p 4 = (4, 2to, Z 4 , 0). Output values as shown in the next 
table. 





Xo Xo + 1 




Xi Xi + 1 




I ■ ■ ■ 7 Vsu 1 Vsp-i 


? 


2/2 


0 


2/3 


2/4 


0 


‘dSu+l > • ■ • ) 


2/1 


2/2 


0 


? 


2/4 


0 


'-Psy + l , ■ ■ ■ , ‘pSp, 


2/1 


2/2 


0 


2/3 


? 


0 


P^w + 1 J • ' • 1 Psp-2 


2/1 


2/2 


0 


7 


7 


0 




2/1 


2/2 


0 


2/3 


2/4 


0 


Ps' • ■ • • ’ Vs' 

*u; + l P-3 


2/1 


2/2 


0 


2/3 


2/4 


0 


flm—b 


2/1 


2/2 


0 


2/3 


2/4 


0 



We compute Lph{x\), Ph{x\ + 1) and the outputs of F on frm-G- If 

fail 

TO < & + 1 and F outputs a new hypothesis on for some x, we 

assign h the new hypothesis value, let xq = max{x, y) + 1, output () up 
to Xq — 1, add j+l,...,j + w,j + p — 2 and j +p — 1 to Z?, let Si = 
for 1 < i < (p — 2)/6 — 1, let j = j + ic, p = (p — 2)/6 and go to stage 
TO + 1. 
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If iph{xi) = ys, go to substage 3. 

If (ph{xi + 1) = 2 / 4 , go to substage 4. 

- Substage 3. 

Execute new(ti) for 1 < i < r, new(sj) for u+I<z<w — I. Let 2/5 = 
{3,2m- l,{2,^{l,di,m,ti),. . . ,{l,dr,m,tr)\,Z 2 ), 2/6 = (3, 2m, (2, (I, j + 
w+ I, TO+ I, s„_|_i), . . . , (I, j + tu — I, m+ 1, 2 : 4 )- Output values as 

shown in the next table. 





XoXo + l 




Xi -I- 1 




Vsi , ■ • ■ , Psu 1 T’sp-l 


2/5 


2/2 


0 


2/3 


2/4 


0 




2/1 


2/2 


0 


2/6 


2/4 


0 


Psv + 1 ) • • • 5 T’Su, 


2/1 


2/2 


0 


2/3 


? 


0 




2/1 


2/2 


0 


2/6 


2/4 


0 


T’tl , ■ • ■ , 


2/5 


2/2 


0 


2/6 


2/4 


0 


T’y , ■ • • . 

*p + l *tu-l 


2/5 


2/2 


0 


2/6 


2/4 


0 


/ 7 m— 4 


2/5 


2/2 


0 


2/6 


2/4 


0 



Compute outputs of F on frm- 4 - m < b + 1 and F outputs a new 
hypothesis on for some x, we assign h the new hypothesis value, 

let Xq = max{x,y) + 1, output () up to Xq — 1, add j + 1, . . . , j + v, 
j + w, . . . ,j + p — 1 to D, let Si = for 1 < z < (p — 2)/6 — 1, let 
j = j + V, p = {p — 2 ) /6 and go to stage m + 1. 

- Substage 4 is similar to substage 3. 

- Substages 5, 6, 7 are similar to substages 2, 3, 4, respectively. 

End of stage m. 

j in the algorithm is used as a base index for the arrays that have output their m- 
th hypotheses (sj) before stage m was started. Note that the values are output so 
that the corresponding function is a (/-instructor for all (/ € {1 , ... , k\ except 
one, so fi £ Uj=i Note also that there is no way out of the substages 3, 4, 
6 and 7 of stage & -I- 1. So the algorithm remains forever in some substage (or 
stage 0), and, as is easy to see, the current hypothesis of F have at least two 
anomalies in comparison with the function fi, corresponding to this substage 
(mindchanges after the b-th mindchange made by F are ignored). □ 



Theorem 9. (Vb € lNf)[cdeg(EX^) < ^ . 

Sketch of proof . Denote k = ^ I = 

Consider classes U\, . . . ,Uk such that the unions of fc — 1 classes out of them 
are EXg-identified by strategies Fi,..., F^. We will construct such strategy F 
that will identify Uj=i using Fi,. . . ,Fk as subroutines. 

Denote the input function by /. Strategy F simulates the strategies F\, . . ., 
Efc on /. F waits until k — 1 strategies make their first hypotheses. Suppose 
the strategies are Fi,. . . , Fk-i, and their hypotheses are hi, ... , hk-\. Then F 
outputs its own first hypothesis h based on these strategies and their hypotheses. 
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Suppose & > 0 and 1—1 out of these fc— 1 strategies output another hypothesis. 
Then F outputs its second hypothesis, based on these I — 1 strategies together 
with their hypotheses, and we have reduced our problem to the case of 
identification. 

So it is enough to prove that, if no more than I — 2 strategies make another 
hypothesis, or 6 = 0, then hypothesis h is correct. 

In this case there is at most one strategy among Fi, . . . , Fk-i that does 
not identify / and at most I — 2 strategies that identify /, but output another 
hypothesis. So no more than I — 1 hypotheses among hi, , hk-i are wrong. 

Now we describe the algorithm for . It computes the following infinite table 
and the hypotheses made by Fi on all possible initial segments. 





0 


n 




‘P/ii(0) • 


.. phAn) ■■■ 






Vhk-^{n) ... 



Let the weight of a value in a column be the number of occurrences of this 
value in the column. We will say that values u and v in different columns are p- 
coordinated iff there are p rows that have u and v in the corresponding columns. 

The aim is to find a consistent interpretation of the table, that is, such initial 
subtable, such /q < I and such initial segment (/["I that Iq — 2 oi strategies 
Fi,. . . ,Fk-i output the second hypothesis on a subsegment of and there 
are at least k — Iq rows in the subtable that have no more than one anomaly in 
comparison with Such interpretations will be found for all but finitely many 
n, because the initial segments of / give consistent interpretations starting with 
the segment on which the last of the second hypotheses is output. 

When an interpretation is found, iph outputs values (those that are not al- 
ready output) according to the following rules. 

1 . Value u is output if its weight is at least and it is ^-coordinated with all 
the values already output. 

2. Value u is output if its weight is at least jg gq^al to the cor- 

responding value of g and it is ^-coordinated with all the values already 
output . 

3. Value Ui is output at point Xi if it is /-coordinated with all the values already 
output and there is a column X 2 such that: 

a) at point X 2 a value U 2 with weight at least has been output; 

b) there is another value V 2 yf U 2 in column X 2 such that the number of 
rows that have not u\ at x\ and have not V 2 at X 2 does not exceed / — 1. 

4. Suppose there are at least 2/ — 1 rows that have a guaranteed anomaly in a 
fixed finite set of columns, and we can output values in these columns making 
no more than one error. In such situation the algorithm outputs these values, 
and in further outputs a value iff it is in at least I of these 2/ — 1 rows (any 
output according to the previous rules is terminated) . 

The proof of the correctness of the algorithm is rather long and technical, so 
we omit it here. □ 
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Corollary 4. (V6 G IN)[cdeg(EX^) = ^ 

Theorem 10. (Va G IN | a > 1)(V6 G IN)[cdeg(EX^) = oo]. 

The method of proof is similar to the one used in Theorems Q and 0 we omit 
it here. 

6 Conclusion 

The next table summarizes the obtained closedness degrees. 





0 


1 


2 






0 


4 


9 


CX5 


CX5 


4 


1 


8 


51 


CXD 


CXD 


8 


2 


16 


303 


CXD 


CXD 


16 








CX) 


CXD 




n 


2U+2 


7-6"+M3 

5 


CX) 


cx:) 


Qn-\-2 








CX) 


CK) 




* 


3 


3 


3 


3 


3 



More interesting than finding the closedness degrees for other identification 
types (such as BC“, CONS“, [k, /]EX^, etc) is the question: for which identifi- 
cation types the cdeg is finite? How does the cdeg value affect the hierarchy of 
success ratios k/l that yield classes [fc, /]EX^ that are different in their learning 
power? And what is this hierarchy in the cases when cdeg is infinite? It seems 
that this hierarchy is not well ordered, unlike the cases that have been inve- 
stigated at the moment. Since n-closedness uncovers structural aspects of the 
identification types, we feel that a further research in this direction is needed. 
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Abstract. Exact learning of half-spaces over finite subsets of IR" from 
membership queries is considered. We describe the minimum set of la- 
belled examples separating the target concept from all the other ones 
of the concept class under consideration. For a domain consisting of all 
integer points of some polytope we give non-trivial lower bounds on the 
complexity of exact identification of half-spaces. These bounds are near 
to known upper bounds. 

1 Introduction 

We consider the complexity of exact identification of half-spaces over the domain 
M that is an arbitrary finite subset of IR" (n is fixed) . We are interested in the 
model of learning with membership queries. 

The main result of this paper is Theorem 0 describing the structure of the 
teaching set T of a half-space c, i. e. a subset of M such that no other half-space 
agrees with c on the whole T. 

The mentioned theorem is used to obtain the lower bound for the complexity 
of identification of half-spaces over the domain {0,1,. ..,fc — 1}". We show that 
MEMB(HS^) = 12(log”~^ k). For n > 3 this significantly improves f2(log k) lower 
bound m on the considered quantity. The presented result can be compared 
with the following upper bound. From results of M. Yu. Moshkov in the test 
theory CH it follows that 



(see |E]). We remark that for any fixed n there is a learning algorithm that 
requires 0(log" fc) membership queries and polynomial in logfc running time. 
This algorithm was proposed in pnETT5| . 

When M is the set of all integer points of some polytope we give a lower 
bound for the complexity MFMB(HS(M)). We show that for any fixed n and 
I > n and for any 7 there is a polytope P C IR" described by a system of I linear 
inequalities with integer coefficients by absolute value not exceeding 7 such that 
MEMB(HS(P n IR")) = i7(ZL"'/2J log"“^7). We remark that this bound is near 
to an upper bound obtained in the threshold function deciphering formalism: an 
algorithm that learns a half-space over P fl in time bounded polynomially 

M.M. Richter et al. (Eds.): ALT’98, LNAI 1501, pp. 61-^3 1998. 

(c) Springer- Verlag Berlin Heidelberg 1998 
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in I and log 7 using 0(ZL"/2J log" 7) membership queries was proposed in m (n 
is fixed). 

Some other related results see in Sect. El 



2 Preliminaries 

Let M is an arbitrary finite non-empty subset of M". M is considered as an 
instance space. A concept over M is a subset of M . A concept class is some non- 
empty collection of concepts over M. The concept c C M is called a half-space 
over M if there exist real numbers uq, oi, . . . , a„ such that 



C = lx € M \ Xj Gj < Go 



i=i 



( 1 ) 



The inequality in m is called a threshold inequality for c. Denote by HS(M) the 
set of all half-spaces over M. Define HS^ = HS(£’|1) where Afc = {0, Ij • ■ ■ j ^ ~ !}• 
Each half-space over M is a concept. The class HS(M) is a concept class. 

We consider the model of exact learning CHOI with membership queries. 
The goal of the learner is to identify an unknown target concept c chosen from 
a known concept class C, making membership queries (“Is x G c?” for some 
X G M) and receiving yes/no answers. The complexity of a learning algorithm for 
C is the maximum number of queries it makes, over all possible target concepts 
c G C. The complexity MEMB(C) of a concept class C is the minimum learning 
complexity, over all learning algorithms for this class. A set T C M is said to be 
a teaching set for a concept c G C with respect to the class C if no other concept 
from C agrees with c on the whole T. If a teaching set is of minimum cardinality, 
over all teaching sets for a concept c, then we call it minimum teaching set for c. 
Denote by TD(c, C) the cardinality of a minimum teaching set for a concept c. 
TD(C) is maximum TD(c, C) over all concepts c in C. TD(C) is called teaching 
dimension for the class C. It is clear that MEMB(C) > TD(C) (cf. |S|). 

Let Conv (A) be the convex hull of A C IR"; Affdim (A) is the affine dimen- 
sion of A. For a concept cC M denote by Nq{c) (resp. Ai(c)) the set of vertices 
of Conv (c) (resp. Conv (M \ c)). Denote Py{c) = Conv A,^(c) {v = 0, 1). 



3 Auxiliary Results 

We first remark that a concept c over the domain M belongs to HS(M) if and 
only if Po{c) fl Pi{c) = 0. Indeed, the necessity is evident and the sufficiency 
follows from the Separating Hyperplane Theorem (see |5|). 

Associated with each half-space c over M is the cone K{c) of separating 
functionals a = (oq, oi, . . . , a„, o„+i) in an (n -|- 2) -dimensional vector space 
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nasi; K{c) is described by the conditions 

' n 

X) ^ for each x € c , 

i=i 

< " ( 2 ) 

X ^ 3^3 ^ oo + Un+i for each x £ M \ c , ^ ' 

i=i 

, <ln+l > 0 . 

Any solution (uq, ■ ■ ■ ,a„+i) of this system, with a„+i > 0, defines a threshold 
inequality for c. The opposite is also true: the coefficients (ao,...,a„) of any 
threshold inequality of c satisfy the system 0 for some positive value of Un+i- 
For any Tq C c, Ti C M \ c we consider the next subsystem of 0 : 



' n 

X for each x £Tq 

< " 

X o.jXj > oq + a„-i-i for each x £T\ 
i=i 

^ ^n+l ^ 0 • 



( 3 ) 



Denote by K{Tq,Ti) the cone consisting of its solutions. The set 



1 



-1 



0 



iC*(To,Ti)=<^^A, -a: +^aJ x + 0 | A, > 0, r/ > 0 



xeTo 



xeTi 



-1 



is a cone, dual to K(Tq,Ti). A cone is said to be pointed if it does not contain 
non-zero subspaces. 



Lemma 1. For any Tq Q c, Ti Q M \ c the eone K*(Tq,Ti) is pointed. 

Proof. Since 0 G K{Tq,Ti), for some non-negative and A^ {x £ Tg U Ti) we 
have that 0 = X ' (1> “ 2 :, 0) -I- X A^, • (—1, x, —1) -I- zz- (0, 0, 1); consequently, 

xGTo xGTi 

X -^x = X Aj, = If = 0 then for any x £ Tg U Ti it holds that A^ = 0, 

xGTo xGTi 

hence K*(Tg,Ti) is a pointed cone. If ^ 0 then the point y = p X ^xX = 

xGTo 

p X ^xX, evidently, belongs to Pg fl Pi that is impossible. □ 

xGTi 



Lemma 2. For any c £ HS(M) the dimension of K(c) is n + 2. 

Proof. It is known that the cone K has the full dimension if and only if the 
dual cone K* is pointed. Since AT(c) = K{c,M\c)), the assertion follows from 
Lemma □ □ 



Lemma 3. //AffdimM = n then for any c £ HS(M) the cone K(c) is pointed. 
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Proof. It is sufficient to verify that if a = (oq, oi, . . . , a„, a„+i) G K{c) and 
—a G K{c) then a = 0. From the system (|2|) we get that in this case a„+i = 0 
and, consequently, 

[ ” 

M C = {xi,X 2 , ■■■,Xn) I '^a.jXj = ao 

Since the dimension of M is n, all Oi (z = 0, . . . , n) are zeroes. □ 

Now the following is a consequence of the theory of linear inequalities nni. 

Lemma 4. //AffdimM = n then for every c G HS(M) 

1) the cone K{c) has a unique up to positive factors generating system (the 
system of extreme rays ) 

{fyd = * = l,---,s} ; (4) 

2) there are unique sets Tg(c) Q c, Ti{c) C M \ c such that 0) is equivalent 
to the system 




a-jXj < ao for each {xi,. . . , x„) G Tb(c) , 

djXj > ao + a„+i for each {xi,. . .,Xn) G Tfyc) , '' ’ 

^ ^n+l ^ 0 

and no subsystem of (E|) is equivalent to the system m; 

3) for any x = {xi , . . . , x„) G 7o(c) there is a subset I C {1, . . . , s} such that 
|/| = rz + 1, the system {5(®),z G /} is linearly independent and 

n 

= ^0 ^ (* e ^i+i > 0 ; (6) 

j=l i£l 

4-) for any x = (xi , . . . , Xn) G 7i(c) there is a subset I C {1, . . . , s} such that 
\I\=n + 1, the system {b^‘‘\i G /} is linearly independent and 

n 

Y = ^0 ^ ^ Y ■ 



□ 

There is the standard method to reduce the problem with AffdimM < n 
to the case of full dimension. Let M C Q". Denote by AffM the affine hull of 
M. Suppose that Aff M = {x G K" | Ax = b} for some A G . Let D be a 

Smith’s normal diagonal matrix for A, the matrices P and Q are unimodular ma- 
trices such that PAQ = D. Without loss of generality we can take, D = (7^, 0) 
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where Im is an identity m x m matrix, 0 is a zero n x (n — m) matrix. Perform 
the change of variables x = Qy mapping into We have that PAx = Dy 
, that is, Aff M is described by the conditions y' = Pb where y' = (j/i, . . . , ym)- 
Thus, rewriting remaining conditions in variables y" = {ym+i-: ■ ■ ■ tUu) we get 
the problem in with Affdim M = n — m. We remark that there exist P, Q 

such that the maximal by absolute value coefficient in the new problem does not 
exceed some polynomial in the maximal coefficient of the old problem (see, for 
example, H2|). 

4 Caracterization of Teaching Sets of Half-Spaces 

Theorem 1. Let Tq Q c, T\ Q M \ c. T = Tq VJ T\ is a teaching set for a 
half-space c if and only if m is equivalent to m- 

Proof. The sufficiency of the conditions is evident. We prove their necessity. 
Assume that there is the solution b = {bo, b\,. . . ,bn, &n+i) of Q that does not 
belong to K{c). By Lemma Q we can suppose that bn+i > 0. The threshold 

n 

inequality ^ bjXj < bo defines some concept g S HS(M). We have that b ^ 
i=i 

K{c), thus g ^ c. But g agrees with c on T. Hence T is not a teaching set. □ 
This theorem leads to 

Corollary 1. Let To c, Ti G M \ c, then for any c C HS(M) the set T = 
Tq U Ti is a minimum teaching set if and only ifTi, = T,^{c) (^ = 0, 1). □ 

We note that the 2nd assertion of Lemma 0 is true for any M C M", also 
when Affdim M < n. By Corollary Q we now get 

Corollary 2. For any c € HS(M) there is a unique minimum teaching set. Lt 
is contained in every teaching set of c. □ 

Denote by T{c) = Tq(c) U^i(o) the minimum teaching set for c. 

Corollary 3. (Cf. For any c G HS(Af) it holds that T{c) C No{c) U 

iVi(c). 

Proof. It is obvious that for T^, = N,j{c) the system Q is equivalent to the 
system (0. The assertion of the corollary follows now from Theorem ^ C 

Let Affdim M = n and c S HS(M). Without loss of generality we can assume 
that in 0) it holds that > 0 for any i = 1, . . . ,/r and b^^+i ~ ^ for any 

i = p, 1, . . . , s. Let a = (oi, . . . an), 



Mo{c,a) = {yi,...,yn) G M \ ^ajy^ = ma.x'^ajx 



xGc 

i=i 
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Mi(c, a) = < {yi,...,yn) € M | ^ ajyj = min ^ UjXj 
{ i=i i=i 

Denote by N^{c, a) the set of vertices of the convex hull of M^{c, a). 

Theorem 2. If AffdimM = n then for any c € HS(M) it holds that 

M 

T{c) = U (fVo(c,6W) UlVi(c,6W)) = |J (7Vo(c, a) U iVi(c, a)) , 

2=1 a 

in the right-hand side the union is over all a = (oi, . . . , a„) € IR" such that the 
inequality 

n n 

GjXj < max^^ QjXj 

i=i i=i 

is a threshold inequality for c. 

Proof. First we prove the inclusion T(c) C [J (No{c,b‘-'^'>) U , Let 

1=1 ' 

y = (i/i, . . . yn) € Tq(c). By the 3rd assertion of LemmaEl there is z S {1, . . . , 
such that ^ Since b^^+i > 0, the coefficients 6^*^ (j = 0, 1, . . . , n) are 



1=1 



the coefficients of a threshold inequality for c and max V Xjb^f^ = It follows 
from this that y G Mo(c, &b)). Assume that y ^ iVo(c, 6(d), i.e. y = 

9=1 

p — 

for some p > 1, a, > 0, Uq = I, y ^ ?/(«( G A/q(c, 6(d) {q = l,...,p). 

9=1 

Then y ^ Nq{c) and, by Corollary 0 y ^ To(c). This contradiction shows that 
y G A)) (c, 6(d). The case y G Ti(c) is proved similarly by the 4th assertion of 
Lemma 0 

We now prove that IJ {No{c, a) U Ni{c, a)) C T(c). Let a = (oi, . . . , a„) G M" 

a 

n n 

and oo = max ^ ajXj] X) < oo is a threshold inequality for c. For any 



i=i 



i=i 



point z G Nq{c,o) we consider a concept g = c \ {z}. Let us prove that g G 
HS(M). Assume the contrary, then Po{g)f]Pi{g) ^ 0. This means that there 
are points a:(^( , ■ ■ • , a;(i'( in p, points y(*^( , ■ • ■ , in M\g, and positive numbers 

ai, . . . ap, Po, . . . Pq such that 



p 



a; = (xi, . . . , x„) = arX^'^'’ = Pty^*^ , 



Pr) ^ 
r—1 t—0 



( 7 ) 



X) Oi- = 1, /3( = 1 where x G Po(5)n-Pi(s)- It is clear that among y(°(, . . . , y(i( 

r—1 t—0 

there is a point z, since otherwise we obtain that Po(c)ri7’i(c) that is 



Lower Bounds for the Complexity of Learning Half-Spaces 



67 



impossible, because it holds that c S HS(M). Let z = We have that 

n ^ ” C " I'ti " 

QjXj = Y Y '^j^j = Pt Y o-jUj + Po Y the last formula 

j=l r=l j = l t=l j=l j=l 

the central part does not exceed oq; in the right-hand side the first addend is 
greater than oq, and the second one is equal to oq. For the equality it is necessary 

n , . 

that q — 0 and Y = oq (r = 1, . . . ,p). Thus, = 1, z = x. From dTj) we 

f=i 

now obtain that z ^ iVo(c, a), that contradicts the condition. Hence g G HS(M). 
Since c and g differ only at one point, we have that z £ T{c). 

n 

Suppose now that a = (oi, . . . ,a„) £ M", oq = min Y o.jXj. The inequa- 

x£M\c j—i 
n 

lify Y ^ 3^3 — is true for any point in M \ c and it is false for any point in c. 
t=i 

For each z £ Ni{c,a) we define a concept g = cU {z}. The further proof is the 
same one described above. 

It is obvious that IJ (NQ{c,b(^^) U Ni{c,b^^^)\ C [J (A^q(c, a) U A^i(c, a)). The 

i=l ^ 'a 

last inclusion finishes the proof of the theorem. □ 

The example x £ M is called essential for a concept c £ HS (M) if there is 
g £ HS (M) such that c and g agree on M \ {x} and don’t agree at the point x. 
From the last part of Theorem|2|it follows that T{c) is exactly the set of essential 
examples for c. For the case of Boolean domain this is a well-known result 
(see 0 and related papers referenced in jSj). 

As an example of Theorem|2| consider the concept c £ HS(£’|) defined by 
the threshold inequality 20xi -I- 28x2 + 86 x 3 < 140. Rewrite the system as 
Qa > 0 where a = (oq, . . . is a column of variables and Q is a matrix 

formed from the coordinates of the points of T{c). Let H be a matrix formed 

from the entries of the vectors &b) ^ S = QB, I is an identity matrix. The matrix 




IS represented in Table01 We have that g = 3, 

No{c,bW) = iVi(c,^) = {gW,g(2)} , 

iVo(c,?^) = {pW,p(2)}, = , 

iVo(c,6(3)) = N3{c,b(^)) = 

where = (7,0,0), = (0,5,0), p^^^ = (0,0,4), = (4,1,1), = 

(3,3,0), g(3) = (2,0,3), = (56,8,11,14,1), 6 ( 2 ) = (70,10,14,17,1), 5(3) = 

3 — 

(140,20,28,35,140). By Theorerngl T^{c) = Q N^{cM^'>) {v = 0,1). For the 
considered example in the union it suffices to retain solely 2 members. Indeed, 
To(c) = No(c,m) = Noic,W))UNo{c,m), Ti(c) = iVi(c,^) U iVi(c,J^). 
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Table 1. Example of Theorem0 
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5 Bounds for the Teaching Dimension of Half-Spaces 

Denote by N the set of vertices of the polytope Conv M. 

Lemma 5. If c = M or c=% then it holds that T{c) = N. 

Proof. Assume that for c = M there is a point x G N \T{c). Consider the 
concept g = M\{x}. Since x G N, it is clear that g G HS(M) and, consequently, 
X G T(c). We have proved that N C T(c). The opposite inclusion follows from 
Corollary El For c = 0 the lemma can be proved by analogy. □ 

From Lemma El it follows that TD(HS^) > 2". Indeed, assume that c = 
By Lemma El we have that TD(c) = 2", hence TD(HS^) > 2". Thus, no 
polynomial in n algorithm for learning half-spaces over from membership 
queries exists. This was originally proved in M 

Let P be a polytope in M" that can be described as an integer system of I 
linear inequalities with integer coefficients whose absolute values do not exceed 
7. Denote by P{n, 1, 7) the class of all such polytopes. For the class HS(M) with 
M = P n 2Z'^ and P G V{n, 1, 7) we have 

Theorem 3. For every natural n > 2 and I > n there is 70 such that for every 
7 > 7o there exists a polytope P G V{n,l,j) such that 

MEMB(HS(M)) > TD(HS(M)) > log"”S 

where M = P C\ TZf^ and Dn is some positive quantity depending only on n. 

Proof. It was proved in 0 (cf. 0 ) that for any fixed n > 2 and I > n, for any 
sufficiently large 7 there exists a polytope P G 'P{n,l,'y) such that the number 
of vertices of Conv(Pp|^") is not less than P>„ZL’T’/2J log"“^7. The assertion 
to be proved follows now from Lemma O □ 
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Return to the class HS^. Denote by N{ao,ai, . . . ,an) the set of all vertices 
of a convex hull of solutions of the following system: 

{ X; ajXj = ao ; 

J=i 

Xj > 0; Xj £ Z {j = 1, ... ,n) . 

In [TS| S. I. Veselov got a lower bound for the mean quantity of |iV(ao, ai, . . . , a„)| 
(see Sect. 3.5 of [ES])- This leads to 

Lemma 6. For every n > 2, k > 2 there are positive numbers ao,«i,...,a„ 
such that Ui < k — 1 (z = 0, 1 , . . . , n) and 

|iV(ao,ai, ■ ■ • ,«n)| > Cnlog””^ k 

where Cn is some positive quantity depending only on n. □ 



Theorem 4. For every n >2 and k >2 

Cn log”-^ k < TD(HS^) < C'n log”-^ k 

where Cn and C'n some quantities depending only on n. 

Proof. The lower bound was announced (without a proof) in cn. To obtain it 
we construct a concept c in the following manner. Consider ao, oi, . . . , o„ in the 
assertion of Lemma El as the coefficients of a threshold inequality of c. Since 
1 < Oi < — 1, we have that N{ao, . . . ,an) C E^. From Theorem El it fol- 

lows that T(c) D N{ao, .. ., a„), hence, TD(c, HS^) > C„ log"“^ k, consequently, 
TD(HS^) >C„log”-"A:. 

The upper bound was proved by T.Hegediis [Tj on the base of ESI- It is 
clear that for T^, = Nn{c) the system 0 is equivalent to the system 0 , hence 
T(c) C 7Vo(c) U Afi(c); it is known j?! that |A^o(c)| + |lVi(c)| < fc where 

C'n is some quantity depending only on n. Thus for any concept c £ HS^ the 
inequality TD(c) < Cflog"~^ k holds. □ 

The lower bound in Theorem El gives us that MEMB(HS^) > C„log”~^ k. 

6 Related Results and Open Problems 

In proving the lower bound for the teaching dimension of half-spaces over E'^ we 
used the fact that the quantity p, in Theorem El is at least 1. An open problem 
remains: it would be helpful to estimate from above the quantity p (we remark 
that for n > 3 there are examples with /r = 2, 3). In this way one could apparently 
decrease the upper bound on TD(HS^). For instance, it is known from EH that 
TD(HS^) = 4. This result is of considerable interest because (as it was shown in 
EUSI) MEMB(HSfc) = 6>(logfc). 



70 



V.N. Shevchenko and N.Yu. Zolotykh 



References 

1. Angluin, D.: Queries and concept learning. Machine Learning (2) (1988) 319-342 

2. Anthony, M., Brightwell, G., Shawe- Taylor, J.: On specifying Boolean fnnctions by 
labelled examples. Discrete Applied Mathematics 61 (1) (1995) 1-25 

3. Barany, I., Howe, R., Lovasz, L.: On integer points in polyhedra: a lower bound. 
Combinatorica (12) (1992) 135-142 

4. Bultman, W. J., Maass, W.: Fast identification of geometric objects with mem- 
bership queries. Information and Computation 118 (1) (1995) 48-64 

5. Chernikov, S.N.: Linear Inequalities. “Nauka” Moscow (1968). German transL: 
VEB Deutscher Verlag Wiss. Berlin (1971) 

6. Chirkov, A. Yu.: On lower bound of the number of vertices of a convex hnll of integer 
and partially integer points of a polyhedron. Proceedings of the First Internatio- 
nal Conference “Mathematical Algorithms”. NNSU Pnblishers Nizhny Novgorod 
(1995) 128-134 (Russian) 

7. Hegediis, T.: Geometrical concept learning and convex polytopes. Proceedings of 
the 7th Annual ACM Conference on Computational Learning Theory (COLT’94). 
ACM Press New York (1994) 228-236 

8. Hegediis, T.: Generalized teaching dimensions and the query complexity of lear- 
ning. Proceedings of the 8th Annual ACM Conference on Computational Learning 
Theory (COLT’95). ACM Press New York (1995) 108-117 

9. Korobkov, V. K.: On monotone functions of logic algebra. Cybernetics Problems. 
“Nauka” Moscow 13 (1965) 5-28 (Russian) 

10. Maass, W, Turan, Gy.: Lower bound methods and separation results for on-line 
learning models. Machine Learning (9) (1992) 107-145 

11. Moshkov, M. Yu.: Conditional tests. Cybernetics Problems. “Nauka” Moscow 40 
(1983) 131-170 (Russian) 

12. Schrijver, A.: Theory of Linear and Integer Programming. Wiley-Interscience New 
York (1986) 

13. Shevchenko, V. N.: On some functions of many-valued logic connected with integer 
programming. Methods of Discrete Analysis in the Theory of Graphs and Circuits. 
Novosibirsk 42 (1985) 99-102 (Russian) 

14. Shevchenko, V. N.: Deciphering of a threshold function of many-valued logic. 
Combinatorial-Algebraic Methods in Applied Mathematics. Gorky (1987) 155-163 
(Russian) 

15. Shevchenko, V. N.: Qualitative Topics in Integer Linear Programming. “Fizmatlit” 
Moscow (1995). English transL: AMS Providence Rhode Island (1997) 

16. Shevchenko, V.N., Zolotykh, N.Yu.: Decoding of threshold functions defined 
on the integer points of a polytope. Pattern Recognition and Image Analysis. 
MAIK/Interperiodica Publishing Moscow 7 (2) (1997) 235-240 

17. Shevchenko, V. N., Zolotykh, N. Yu.: On complexity of deciphering threshold func- 
tions of fc-valued logic. Russian Math. Dokl. (Doklady Rossiiskoi Akademii Nauk) 
(to appear) 

18. Veselov, S.I.: A lower bound for the mean number of irreducible and extreme 
points in two discrete programming problems. Manuscript No. 619-84, deposited 
at VINITI Moscow (1984) (Russian). 

19. Zolotykh, N.Yu.: An algorithm of deciphering a threshold function of fc-valued 
logic in the plane with the number of calls to the oracle O(logfc). Proceedings of 
the First International Conference “Mathematical Algorithms”. NNSU Publishers 
Nizhny Novgorod (1995) 21-26 (Russian) 




Lower Bounds for the Complexity of Learning Half-Spaces 



71 



20. Zolotykh, N. Yu., Shevchenko, V. N.: On complexity of deciphering threshold func- 
tions. Discrete Analysis and Operations Research. Novosibirsk 2 (1) (1995) 72-73 
(Russian) 

21. Zolotykh, N.Yu., Shevchenko, V. N.: Deciphering threshold functions of fc-valued 
logic. Discrete Analysis and Operations Research. Novosibirsk 2 (3) (1995) 18-23. 
English transL: Korshunov, A. D. (ed.): Operations Research and Discrete Analysis. 
Kluwer Ac. Publ. Netherlands (1997) 321-326 




Cryptographic Limitations on Parallelizing 
Membership and Equivalence Queries 
with Applications to Random Self-Reductions 



Marc Fischlin 

Fachbereich Mathematik (AG 7.2)/Informatik 
Johann Wolfgang Goethe-Universitat Frankfurt am Main 
Postfach 111932 

60054 Frankfurt/Main, Germany 
marc @ mi . inf ormatik.uni-f rankfurt .de 
http : //www.mi . inf ormatik.uni-f rankfurt . de/ 



Abstract. We assume wlog. that every learning algorithm with member- 
ship and equivalence queries proceeds in rounds. In each round it puts in 
parallel a polynomial number of queries and after receiving the answers, 
it performs internal computations before starting the next round. The 
query depth is defined by the number of rounds. In this paper we show 
that, assuming the existence of cryptographic one-way functions, for any 
fixed polynomial d{n) there exists a concept class that is efficiently and 
exactly learnable with membership queries in query depth d{n) + 1, but 
cannot be weakly predicted with membership and equivalence queries in 
depth d{n). Hence, concerning the query depth, efficient learning algo- 
rithms for this concept class cannot be parallelized at all. We also discuss 
some applications to random self-reductions and coherent sets. 



1 Introduction 

A fundamental problem in computer science is the question if and how sequen- 
tial algorithms can be parallelized. This is an intrinsic problem in computa- 
tional learning theory, too. Parallelizing PAG algorithms m is only a matter of 
parallelizing the internal computations, because a sufficient number of random 
examples can be generated in a single concurrent step |1 2f2Spi 1 j . For learning 
algorithms with membership and equivalence queries m this problem is closely 
related to the “grade of adaptiveness” of the queries. A quantitative formaliza- 
tion is via the query depth of a learning algorithm: We assume wlog. that the 
learning algorithm proceeds in rounds. In each round it is allowed to put in 
parallel a polynomial number of membership and equivalence queries. After re- 
ceiving the answers, it performs some internal computation and then starts the 
next round. The query depth (as a function of some complexity parameter n) is 
the maximal number of rounds, where the maximum is taken over all target con- 
cepts of complexity n. Bshouty and Cleve pn] prove that exact learning with 
membership and equivalence queries e.g. of read-once Boolean functions and 
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monotone DNF formulas in n variables requires a query depth of l7(n/logn). 
Balcazar, Diaz, Gavalda and Watanabe ^ show that DFA with n states can be 
learned exactly with membership and equivalence queries in depth 0{n/ \ogn). 
Moreover, they prove that this bound is optimal as there cannot exist a learning 
algorithm that learns DFA exactly in query depth o(n/logn). These negative 
results are not tight in the sense that it remains open if there is a concept 
class where allowing one additional level of query depth helps. Also, these lower 
bounds are sublinear and hold for exact learning exclusively. In this paper, we 
show that for any given polynomial d{n) there is a concept class such that the 
class cannot be weakly predicted with membership and equivalence queries in 
query depth d(n), though there exists a polynomial-time algorithm that learns 
every target concept in query depth d{n) + 1 exactly with membership queries. 
We emphasize that, adding a single “level of adaptiveness”, we can learn this 
class exactly, while any learning algorithm with depth d{n) miserably fails, i.e., 
cannot satisfy a potentially weaker requirement than PAC-learnability (with 
queries). While our impossibility result as well as the lower bound of Pj only 
holds for polynomial-time algorithms, the result of Bshouty and Cleve is also 
valid for computationally unbounded parallel learners — as long as the number 
of queries is polynomially bounded. 

The intractability of our concept class is based on a cryptographic assumption, 
namely the existence of one-way functions. These are functions that are easy to 
evaluate but hard to invert on a random value. Despite complexity based impos- 
sibility results (see for example |23]) several negative results for learning algo- 
rithms have been based on cryptographic primitives. Angluin and Kharitonov Pj 
use one-way functions to show that membership queries do not add any power 
to PAC-algorithms when learning DNF formulas. Similarly, Kearns and Valiant 
m and Kharitonov m show that polynomial-size Boolean formulas are not 
efficiently PAC-learnable with membership queries if one-way functions exist. 
Rivest and Yi present a concept class based on the existence of one-way 
functions where self-directed learning is inferior to teacher-directed learning. 
We exploit their idea to define our concept class using so-called collections of 
pseudorandom functions: Informally, a collection of pseudorandom functions is a 
sequence (An)ne]N of function sets Fn C {g : {0, 1}" ^ {0, 1}"}. Each set Fn con- 
tains 2" functions, where every function in Fn is identified by a key k G {0, 1}". 
While most of the functions in the set of all 2"^" functions g : {0, 1}" ^ {0, 1}" 
must have exponential description size, Fn only contains a very small fraction 
of these functions and therefore supports short identifiers. Yet, Fn preserves the 
randomness property, that is, if we uniformly choose a key k G {0, 1}" then the 
function described by this key “looks” like a uniformly chosen function from the 
set {g : {0,1}" ^ {0,1}"}. It is well-known that collections of pseudorandom 
functions exist if and only if one-way functions exist. Given any collection of 
pseudorandom functions we define the concepts of complexity n by the keys of 
Fn and such that a particular query sequence of depth d{n) + l yields the key of 
the function resp. the name of the target concept. Hence, we can easily learn the 
target concept in depth d(n) + 1. Gonversely, there cannot exist any probabilis- 
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tic polynomial-time algorithm that, after experimenting using membership and 
equivalence queries in query depth d(n), classifies a random example correctly 
with probability at least 1/2-1- l/p(n) for an arbitrary positive polynomial p{n) 
and all but finite n € IN. Otherwise we derive a constradiction to the pseudo- 
randomness of the underlying collection. 

We apply our result on the non-parallelizability of the queries to random self- 
reductions 0 . Informally, a language £ is self-reducible m if, for any x, we can 
compute the characteristic function xc of £ at x from values Xciui), ■ ■ ■ ,Xc iUm), 
where |yi|, . . . , |yrn| < |*|- Put differently, £ is self-reducible if membership can 
be decided by querying the oracle xc for smaller elements. A classic example of a 
self-reducible language is SAT. An interesting special case of self-reductions are 
random self-reductions, where each query yi is a random value distributed inde- 
pendently of X (but not necessarily independently of the other queries). Unlike 
self-reductions, random self-reductions do not require that the queries are smaller 
elements. The query depth of a random self-reduction is defined analogously to 
the query depth of a learning algorithm. Feigenbaum et al. US! show that adap- 
tive (more specifically, query depth |a:|) random self-reductions are more pow- 
erful than nonadaptive ones. Combining our result with HS| we establish the 
following hierarchy: Let /3(n) be an unbounded, nondecreasing function (t{n) 
such that is time-constructible (e.g., (3{n) = log* n) and let d{n) be a fixed 
polynomial. If one-way functions exist, there is a language in DSPACE(n^*-”^) 
such that there is a random self-reduction with query depth d{n) + I, while ev- 
ery length-preserving random self-reduction of depth d{n) fails. We show that 
similar results can be derived for coherent sets. 

The paper is organized as follows. In Section |21 we introduce notations and 
definitions of learning theory, cryptography and random self-reductions and 
coherence. In Section 0 we define our concept class and prove the positive 
resp. negative result about learnability. Finally, in Section^ we apply this result 
to random self-reductions as well as coherent sets. 

2 Preliminaries 

We introduce some basic notations. For a finite set S let y S denote a 
uniformly chosen element y from S. We write T^jiy) G {Oj 1} for the projection 
onto the j-th bit ofyGlO,!}", where n is understood from the context and j G 
{!,..., n}. For notational convenience, we irrationally switch between natural 
numbers and their binary representations. 

2.1 Computional Learning Theory 

We briefly recall notations and definitions of learning theory. Let X = (A„)„g]N 
denote the domain, where A„ C {0,1}^'^^ for some polynomial p{n). For k G 
{0,1}", a concept Ck is a subset of A„. We call k the name of Cfc. Let C„ = 
jcfe I k G {0, 1}" } and define the eoncept class by C = (C„)neiN- We usually view 
Cfc as a Boolean function; that is, Ck{x) = 1 if a; G Cfc and Ck{x) = 0 otherwise. 
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Let V = (I?„)neiN be a sequence of distributions T>n on We say that V is 
efficiently sampleable if there is a probabilistic polynomial-time algorithm such 
that for input 1" the output of the algorithm is identically distributed to . 

Following Kharitonov m we define a prediction with membership and equiv- 
alence queries algorithm (pwme-algorithm) . Let C be a concept class and T> be 
an efficiently sampleable distribution. The error parameter function e : IN — > 
determines the accuracy of the learning algorithm. A pwme-algorithm L is a 
probabilistic algorithm that gets inputs n and e(n) and, after a target concept 
Ck € C has been chosen, may make in addition to internal computations 

— membership queries, i.e., query the oracle Ck for arbitrary x € Xn 

— equivalence queries, i.e., give k' € {0, 1}" to the oracle and receive the answer 
“yes” if Cfc = Ck' resp. a counterexample x G X„ with Ck(x) ^ Ck'{x) 

— exactly one challenge query, where an example z € Xn is randomly generated 
according to the distribution T>n and returned to L. L is then supposed to 
make a guess for Cfe(z) 

We say that L successfully predicts C with respect to T> and e iff, for all n G IN 
and Ck G Cn, the probability that L’s guess is correct, i.e., equals Cfc(z), is at least 
1 — e(n). We call C efficiently predictable with respect to V and e iff there is a 
pwme-alogithm L that successfully predicts C with respect to T> and e and runs in 
polynomial time in n and l/e(n). We say that C is weakly predictable with respect 
to T> iff it is efficiently predictable with respect to T> and e(n) = 1/2 — l/p{n) 
for some polynomial p : IN ^ and all but finitely many n G IN. We call a 
pwme-algorithm L a pwm-algorithm if L is not allowed equivalence queries. 

Note that C and T> are fixed and therefore known by L. Note also that L can- 
not receive randomly generated examples (as in case of PAG algorithms), because 
we only consider efficiently sampleable distributions. Thus, L can generate an 
example by itself and then put a membership query for this example. Moreover, 
we remark that unpredictability implies impossibility of PAC-learnability with 
queries (see the discussion in El). 

Next, we define the query depth of a pwme-algorithm. We assume wlog. that 
any pwme-algorithm L proceeds in rounds. At the beginning of each round, L 
puts in parallel membership and equivalence queries and receives the answers. 
Then it performs internal computations and starts the next round. After finishing 
the last round, it is allowed additional computations and finally gives its output. 
The pwme-algorithm has query depth d{n) if it takes at most d(n) rounds for 
inputs n, e(n) and all target concepts of complexity n. A concept class C is weakly 
predictable in query depth d{n) with respect to T> if it is weakly predictable by 
a pwme-algorithm with query depth d(n) . 

As for the positive result on the learnability of our concept class, we say 
that a concept class C is exactly learnable in polynomial-time with membership 
queries iff there exists a polynomial-time algorithm L such that for all n G IN 
and Ck G Cn, algorithm L with oracle access to Ck outputs a name k' G {0, 1}" 
such that Cfc = Ck' ■ The query depth of such an algorithm is defined analogously 
to the depth of a pwme-algorithm. If this depth is bounded by d(n), we call C 
exactly learnable with membership queries in query depth d{n). 
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2.2 Cryptography 

In this section we introduce the cryptographic background. A function (5 : IN — > 
IR’’’ is called negligible iff it vanishes faster than any polynomial fraction, i.e., iff 
for any polynomial p : IN ^ IR’*’ there exists ng € IN such that 5{n) < I/p(n) for 
all n > uq. For instance, S(n) = is negligible. For the rest of the paper, we 
abbreviate “there exists no such that ... for all n > uq” by “for all sufficiently 
large n”. In the sequel we use the following facts about negligble functions: Let 
/(n) > l/po(’^) for some positive polynomial po and infinitely many n and let 
5{n) be a negligible function; then /(n) — 6{n) > l/2po{n) for infinitely many 
n. Additionally, it is easy to see that p{n) ■ 6(n) is negligible for any positive 
polynomial p(n) if and only if S(n) is negligible. 

A collection F = (Fn)neiN of functions is a sequence of functions Fn : 
{0, 1}" X {0, 1}" ^ {0, 1}". The first argument is called the key and usually 
denoted by A: S {0, 1}". If it is fixed and n is understood, we write Ffe(-) for the 
function Fn{k, •). For a definition of pseudorandomness we consider the following 
experiment. Let D he & probabilistic polynomial-time algorithm. At the begin- 
ning, a random key k {0, 1}" is chosen and kept secret from D. D is given 
1" (n in unary) as input and is allowed to adaptively query the oracle Ffe(-) for 
values of its choice. Then D outputs a challenge y G {0, 1}" such that y has not 
been queried previously and D is disconnected from the oracle. A bit 6 G/j {0, 1} 
is chosen at random as well as a random string r Gr {0, 1}" and P is given 
(Qo) Qi) where Qb = Fk{y) and Qi-t = r. That is, D receives the value of Fk at 
y and a random string in random order. Finally, algorithm D is supposed to out- 
put a guess g G {0, 1} for b. The distinguishing advantage of D is the probability 
(over the choice of k and the coin tosses of D) that D’s guess is correct minus 
the pure guessing probability: Adv^ = |Prob[6 = g] — 1/2|. Note that Adv^ is a 
function of n G IN, the input of D. Roughly speaking, F is pseudorandom if any 
distinguisher D cannot predict b essentially better than with probability 1 /2 for 
sufficiently large n. 

Definition 1 (Collection of Pseudorandom Functions). A collection F = 
{Fn)n£TN of functions Fn : {0,1}" X {0,1}" ^ {0,1}" is called a collection of 
pseudorandom functions iff 

— there exists a polynomial-time algorithm T such that F{k,x) = Fn{k,x) for 
any k,x G {0, 1}" and all n gN 

— the distinguishing advantage Adv|)(n) of any probabilistic polynomial-time 
algorithm D is negligible 

We remark that the second property is different from, yet equivalent to PH] the 
definition usually used in literature. Also note that the first property means that 
(Fli)„g]N is computable in polynomial time in n. It is well-known that collections 
of pseudorandom functions exists if and only if one-way functions exist pam. 
One-way functions are believed to be the weakest assumption for non-trivial 
cryptography HnE2|. 

In the sequel we will use the following fact about pseudorandom functions. 
Consider the variation of the experiment above, where D, after querying the 
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oracle Fk{-), outputs a pair {y, z) such that y has not been passed to the oracle 
yet. The prediction probability of D (as a function of n) is the probability that 
Fk{y) = z. That is, the prediction probability denotes the probability that D 
can predict the function value at y without having seen it. It is not hard to 
show that for a collection of pseudorandom functions the prediction probability 
of any probabilistic polynomial-time algorithm D is negligible. This comes from 
the fact that if one can predict the value than it is also easy to distinguish it 
from a random string. 

2.3 Randomly Self-Reducible and Coherent Sets 

In this section we introduce the notions of random-self-reductions [3| and coherent 
sets m- The definition of the query depth of the corresponding primitive is a 
straightforward extension of the definition for learning algorithms. The following 
is taken from m-- 

Definition 2 (Random-Self-Reduction). A function f : {0,1}* ^ {0,1}* 
is called nonadaptively k{n)-random-self-reducihle if there exist polynomial-time 
algorithms (f>, a and a polynomial p(n) such that for all x we have 

f{x) = <p(x,r,f{a{l,x,r)),...,f{a{k{\x\),x,r))^ 

with probability at least 2/3 over the choice of r G/j {0,1}^^I^I^. Additionally, 
for all x,y G {0,1}" the random variables a{i,x) and <j{i,y) are identically 
distributed. 

From the definition it immediately follows that a single value a{i,x,r) does 
not yield any information about x. Yet, a(i,x) and cr(j,x) are dependent in 
general and may therefore reveal x. More generally, we consider adaptive ran- 
dom self-reductions where a{i,x,r) may also depend on the previous answers 
/(cr(l, a;, r)), . . . , f{a{i — 1, x, r)) for i = 1, . . . , k{\x\). It is easy to see that the 
error probability 1/3 can be decreased to for any polynomial q{n) by stan- 

dard techniques for both adaptive and nonadaptive reductions. In particular, 
lowering the error probability by majority decision preserves the query depth. 
We remark that the notion of the query depth of random self-reductions has been 
mentioned implicitely in HH though, to best of our knowledge, it has not been 
investigated further — except for the extreme cases of adaptive and nonadaptive 
reductions. 

A random self-reduction is oblivious if the queries a{l, x,r), . . . , a{k{\x\), x, r) 
do not depend on x, i.e., a{i,x,r) = a{i,r) for i = 1, . . . , /c(|a;|). It is called 
deterministic if the queries do not depend on r. In contrast to “ordinary” self- 
reductions we do not restrict the queries a(i,x,r) to be smaller than the input, 
but allow queries with arbitrary length. We say that a random self-reduction is 
length-preserving if \a{i,x,r)\ = |a;| for all i,r. It is called length-monotone if 
\a{i,x,r)\ < \x\. We say that a set £ is randomly self-reducible if xc is. 

Closely related to random self-reducible sets are so-called coherent sets. In- 
formally, these are sets £ where membership of any input x can be efficiently 



78 



M. Fischlin 



decided with help of the oracle X£\{a;}- More formally, let / : {0, 1}* ^ {0, 1} be 
a Boolean function. An examiner for / is a probabilistic polynomial-time oracle 
Turing machine E that, on input x, never queries the oracle / for x. Let (x) 
denote the random variable that describes the output. 

Definition 3 (Coherent Set). A set C is called coherent if there exists an 
examiner E such that E^‘^{x) = xc{x) with probability at least 2/3. 

Again, the error probability can be decreased to while preserving the 

query depth. We say that C is deterministic coherent if E is (deterministic) 
polynomial-time. C is called weakly coherent if if is a polynomial-size circuit 
family. In this case, we say that if is a weak examiner. If C is not coherent it is 
called incoherent. 

It is easy to see (for example |5]) that for every language £ the set £ 0 £ = 
{Oa; |a;G£}U{la; |a;e£} is coherent. Additionally, Beigel and Feigenbaum 
p] show that every randomly self-reducible set is also weakly coherent. The 
converse is unlikely to hold, as every NP-complete set is coherent but, unless the 
polynomial hierarchy collapses at the third level, is not randomly self-reducible 
in query depth £>(logn). See ^ for details. 



3 Limitations on Parallelizing Queries 



First, we define our concept class based on any collection of pseudorandom 
functions. Then we show that this class cannot be predicted with membership 
queries in depth d{n), though it can be learned exactly in depth d{n) + l. Finally, 
we discuss that prediction remains hard even if we add equivalence queries. 

Let F = {Fn)neTN be a collection of pseudorandom functions and let d{n) be 
a fixed polynomial. For a function Ffe(-) = Fn{k, •) and z = 0, . . . , d{n) define 




0 " 



if i = 0 
else 



That is, is obtained by iterating i-times £fc(-) at 0". For each k G {0,1}” 
alter F’fc(-) to a function £/(•) by setting 







ifx = yf-^^ 
else 



Thus, the only difference between £/ and Fk is that Fj^ reveals the key if it is 
evaluated at = Fk{- ■ ■ F’fc(0”)). Define the concept class C = (Cn)neiN by 

Cn = {ck \ k € (0, 1}”}, where 






{(x,j)G{o,i}”+r'°s”i 



^,(F/(a;)) = l} 



Recall that T^j{F^{x)) is the projection of F^{x) onto bit j. The distribution 
T>n on {0, l}"+l'°g"l is described by picking x (0, 1}" and j {!,..., n} 
independently. Obviously, T> is efficiently sampleable. 
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Lemma 1. The concept class C is exactly learnable with membership queries in 
query depth d{n) + 1. 

Proof. Let Ck be the target concept. In each round z = 1, . . . , d{n) + 1 query in 
parallel the oracle Ck for {yk~^\ l), ■ ■ • , Clearly, we can reconstruct 

from the answers. Therefore, we finally obtain = k. □ 

Lemma 2. C is not weakly predictable with membership queries in query depth 
d{n) with respect to T>. 

The outline of the proof is as follows. If C was weakly predictable then this would 
also hold if we choose the target concept at random, namely select k Gr {0, 1} 
and let Ck be the target concept. Since the query depth of the learning algo- 
rithm is bounded by d(n), it cannot query for and therefore obtain the 

key k, unless it can guess at least one of the values y^\ . . . , (which, 

as we will see, are distinct with high probability). But this would contradict 
the unpredictability of the pseudorandom function. Hence, as the learning algo- 
rithm cannot obtain the key, predicting a random example is almost as hard as 
distinguishing between the value of the pseudorandom function from a random 
string. The formal proof is deligated to Appendix El We obtain: 

Theorem 1. If one-way functions exists, then there is a concept class that is 
not weakly predictable with membership queries in query depth d{n), but can be 
learned exactly with membership queries in query depth d{n) 1 for any fixed 
polynomial d{n). 

It remains to show that adding equivalence queries does not help learning in 
query depth d{n). The idea is similar to Angluin’s well-known technique Q 
replacing an equivalence query by a polynomial number of parallel membership 
queries. In our case this is even much simpler than in general. Assume that 
L puts an equivalence query for k' G {0, 1}". Then, for a randomly chosen 
X Gr {0, 1}", we have Fk{x) yf Fk'{x) with probability at least 1 — l/g(n) >1/2 
for every polynomial q and sufficiently large n. Otherwise we could use L to 
construct a successful predictor for pseudorandom functions, because guessing 
the key is even harder than predicting a single value. Thus, with probability at 
least l/2n it holds nj{Fk{x)) yf nj{Fk'{x)) for j Gr {1, . . . , n}. If we execute 2n^ 
such membership queries in parallel then with probability at least 1 — we find 
a counterexample. Summing over all (at most polynomial) equivalence queries 
we find counterexamples for all queries with probability at least 1 — poly(n) -e”". 
Hence, this simulation only fails with negligible probability and we can therefore 
apply the argument of the previous theorem. 

Theorem 2. If one-way functions exists, then there is a concept class that is 
not weakly predictable with membership and equivalence queries in query depth 
d{n), but can be learned exactly with membership queries in query depth d(n) -\- 1 
for any fixed polynomial d{n). 
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4 Applications to Random Self-Reductions and Coherent 
Sets 

Feigenbaum, Fortnow et al. present a set L in DSPACE(n^^”)) for any 
unbounded, nondecreasing function /9(n) (with time-constructible) such 

that C is adaptively randomly self-reducible, while nonadaptive random self- 
reductions do not exist. This results holds unconditionally. Assuming NEEE ^ 
BPEEE, they show that there exist such sets in NP. This assumption has been 
reduced to NE ^ BPE by Hemaspaandra, Naik, Ogihara and Selman ^3- Com- 
bining the idea of Feigenbaum et al. with our result for learning algorithms 
we obtain the following: 

Proposition 1. Let (3{n) he an unbounded, nondecreasing function such that 
time-constructible and •2“" is negligible. Let d{n) he a fixed poly- 
nomial. Lf one-way functions exists, there is a language L in DSPACE{nl^^'^^) 
such that there is no length-preserving random self-reduction of query depth d(n) 
for C, though there is a deterministic, obliviously, length-preserving random self- 
reduction of query depth d(n) -|- 1. 

We remark that • 2“” is negligible if, for instance, (5{n) ■ logn < n/2 for 
sufficiently large n. This is true for /3(n) = log* n. 

Proof. The proof is similar to the proof given in m- We view a random self- 
reduction given by algorithms a and ^ as a single Turing machine M . The choice 
of (3{n) ensures that > p{n) for any polynomial p{n). We can therefore 

diagonalize against the length-preserving random self-reductions Mi, M 2 , ... of 
query depth d(n). The language C consists of tuples {x,j) such that TTj{F^ (x)) = 
1 for an appropriate key fc G {0, 1}". 

Mi’s running time and therefore the number of queries is bounded above by 
n^(n) ^ Any query a{j,x) of Mi is distributed independently of x. If we choose 
a random input {x,j) {0, and let Mi run on that input, then 

with probability at most ■ 2“" the value {x,j) appears among the queries. 
By assumption, • 2“" is negligible. Hence, given that Mi does not query 

the input, we can turn Mi (that decides membership correctly with probability 
at least 2/3) into a successful distinguisher for the underlying pseudorandom 
function. From this we derive that every length-preserving random self-reduction 
fails with probability more than 1/3 for all sufficiently large n. We conclude that 
we can determine in space a key k and a tuple {x,j) such that Mi 

fails to predict TTj{F^{x)) with probability more than 1/3. Add all (x,j) with 
T^jiFkix)) = 1 to £. 

The fact that this language is obliviously, randomly self-reducible is straight- 
forward as we can determine the key k in depth d{n) -\- 1. Then we can easily 
decide whether the input (x,j) is in £ by computing TTj(F^{x)). □ 

Assuming that even for polynomial-size circuit families D (instead of proba- 
bilistic polynomial-time distinguishers) the collection of pseudorandom functions 
remains pseudorandom, we can extend our result to length-monotone random- 
self-reductions. We remark that security against nonuniform distinguishers is 
also a widely accepted assumption in cryptography. 
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Corollary 1. Let f3{n) be an unbounded, nondecreasing function and assume 
that is time-constructible and that • 2“” is negligible. Let d{n) be 

a fixed polynomial. Lf one-way functions exists that are secure against nonuni- 
form adversaries, there is a language C in DSPACE{nl^^^'>) such that there is no 
length-monotone random self-reduction of query depth d(n) for C, though there 
is a deterministic, obliviously, length-preserving random self-reduction of query 
depth d{n) + 1. 

Proof. The proof is a straightforward extension of the proof of Proposition Q 
Again, if there was a length-monotone random self-reduction we could con- 
struct a polynomial-size circuit family with distinguishing advantage that is 
not negligible. To answer queries that have smaller length we give the circuit 
that simulates the random self-reduction for inputs of length n -I- [log n] the first 
n — 1 keys determined by £ for complexity parameters 1 , . . . , n — 1 as nonuniform 
advice. □ 

Beigel and Feigenbaum 0 prove that every randomly self-reducible language is 
weakly coherent. Analyzing their proof it is easy to see that their transformation 
of a random self-reduction to a weak examiner preserve the query depth. 

Corollary 2. Let j3(n) be an unbounded, nondecreasing function. Assume that 
n^{n) is time-constructible and that •2“" is negligible. Let d{n) be a fixed 
polynomial. Lf one-way functions exists, there is a language L in DSPACE{n^^'^^) 
that is incoherent for length-preserving examiners of query depth d{n), though 
there exists a weak, length-preserving examiner of query depth d{n) -\- 1. 

Again, this conclusion can be extended to length-monotone examiners. Unfor- 
tunately, we do not know whether the positive result of Corollary 0 also holds 
for probabilistic polynomial-time examiners instead of weak examiners. But we 
achieve this using a somewhat stronger assumption, namely the existence one- 
way permutations: 

Proposition 2. Let f3{n) be an unbounded, nondecreasing function and assume 
that n^{n) is time-constructible and that • 2“" is negligible. Let d{n) be 

a fixed polynomial. Lf one-way permutations exists, there is a language £ in 
DSPACE{nl^^'^'>) that is incoherent for length-preserving examiners of query depth 
d{n), though there is a deterministic, length-preserving examiner of query depth 
d{n) -\- 1. 

Proof. Given a one-way permutation we can construct a collection of pseudo- 
random functions such that Fn{k, 1") yf Fn{k' , 1”) for k ^ k'; see [TTlj . Similar 
to the proof of Claim [Don Pagein we conclude that y^\ . . . , yf 1" for all 

but a negligible fraction of the keys k G {0, 1}”. Hence, the impossibility result 
remains valid if we restrict ourself to such keys. Now it also suffices to show 
the positive result for those keys. Assume that the examiner E is given (x,j) as 
input. If a; ^ i then E can compute k in depth d{n) -\- 1 without 

querying for {x,j) and decide whether {x,j) G £ by computing TTj{F^{x)) in 
polynomial time. Suppose that x = y^'^ for some i. Then the examiner cannot 
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query for (y^\j). Fortunately, there are only two possibilities, namely (x,j) G £ 
or i^jj) ^ £• £ tries both possibilities in parallel and also asks for Ffc(l”) in a 
single concurrent step. This is possible as 1" is different from . . . , by 

assumption about k. Thus E derives two keys kg and ki and the value Ffe(l") in 
query depth d{n) + 1. It determines the correct key by computing and 

in polynomial time and comparing it to the value obtained for £^(1”). 
Given the key k the examiner can decide whether (x,j)GC. □ 

Assuming one-way permutations that are secure against nonuniform adversaries 
we can extend the negative result to length-monotone examiners. 
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A Proof of Lemma 2 

Assume that there exists a pwm-algorithm L that weakly predicts C with re- 
spect to T>. Let p{n) denote the polynomial such that L predicts correctly with 

probability at least 1/2-1- l/p{n) for infinitely many n G InO] Since L predicts C„ 

for all target concepts Cfc, it also predicts C„ if we choose k Gr {0,1}" and thus 

^ Note that we only demand that L predicts correctly infinitely often. Weak pre- 
dictability actually requires L to predict correctly for all sufficiently large n. This 
even strengthens our result. 
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Cfe at random. From L we construct a successful distinguisher D for the collection 
of pseudorandom functions F = {Fn)neTN- F> is given oracle access to a function 
Ffc(-) in Fn, where k Gr {0,1}” is chosen at random. Basically, D simulates L 
and for each membership query (x,j) of L algorithm D queries “on-line” the 
function oracle for x and, given the answer z = Fk{x), returns TTj{z) to L. 

We start by showing that . . . , are distinct with high probability. 

We use this fact to prove that the probability that the learning algorithm L 
queries j) for some j is negligible. If L does not query for 

any j then D is able to answer all queries of L using its oracle Ffe(-)- This is 
possible as Ck{x,j) = TTj{Fk{x)) except for x = If L queries {y\^^"‘'’\j) 

for some j then D is supposed to return the j-th bit of the key k to L, because 
Ck{ylf^'^'^\ j) = But D does not know the secret key 

k and cannot guess it, because this would contradict the pseudorandomness of 
the underlying collection. Hence, if L queries then the simulation fails. 

Fortunately, the probability of this is negligible and, given that the simulation 
succeeds, it is easy to show that L cannot weakly predict the concept class. 

Claim 1; The probability that y^*^ = y^^~^ for i < j with i,j G (0, . . . , d{n)} is 
negligible. 

Proof. We prove that otherwise there exist a polynomial-time algorithm D' 
that successfully predicts the value Fk{y) for an appropriate y with probability at 
least l/y'(n) for a polynomial q' {n) and infinitely many n G IN. Assume that the 
probability that there exist as in the claim is not negligible. More precisely, 
let this probability be greater than l/y(n) for a polynomial q and infinitely many 
n. For a fixed key k we call a pair (i,j) bad if f < j and y^*^ = y^^\ If a bad pair 
exist then there is also a minimal bad pair {io,jo), i.e., such that there does not 
exist another bad pair (i,j) with j < Jq. We construct D' as follows. D' tries to 
guess (zq, jo) by choosing J G^ {1, . . . , d{n)} and I G/j {0, . . . , J— 1} at random. 
Then D' computes y^^\ . . . by querying the oracle Fk{-)- If 7 ^ 

y[°\ . . . , ylf~'^^ then D' outputs {y\^~^\yk'^)- Else D' gives an arbitrary output. 
If there exist a (minimal) bad pair (zq, jo) then (/, J) = (io,Jo) with probability 
at least l/ci^(n). In this case, yf y^°\ ■ • ■ , because (zo, jo) is minimal. 

Additionally, Fk{y\^~^^) = ylf'^ = ylf'^. Hence, the prediction of D' is right with 
probability at least I / d'^ {n)q{n) . This contradicts the unpredictability of F. □ 

We show that the probability (over the random choice of the target concept and 
the internal coin tosses of L) that L queries is less than any polynomial 

fraction. 

Claim 2; The probability that L queries j) for some j G (1, . . . ,n| is 

negligible. 

Proof. Suppose, towards contradiction, that L asks a membership query for 
(yl'^l”)), j) for some j with probability at least l/y(n) for a polynomial q and 
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infinitely many n € IN. Then we derive a predictor D' for F with prediction 
probability 1/q’ (n) for a polynomial q’ and infinitely many n. First observe that 
the probability that y^\ . . . , are pairwise different and that L queries 

jg least l/2g(n) for infinitely many n; though these events are not 
necessarily independent we can assume by claim □ that are 

distinct with probability at least 1 — l/2q{n) for sufficiently large n. The bound 
1 / 2q{n) therefore follows from the fact that Prob [AAB] > Prob [ A] — Prob [ ^ B] 
for any events A, B. 

Recall that the query depth of L is d{n). Thus, given that L queries and 

that . . . , are pairwise different, there exist i,r G {1, . . . , d(n)} such 

that L queries in round r without having queried in the preceding r — 1 

rounds. Since D' does not necessarily know i and r it tries to guess these values by 
picking /, R Gr {1, . . . , d{n)} uniformly at random. D' computes y^\ • ■ • , ylf 
via the function oracle and then simulates L until L has output the membership 
queries for round R. Let pi(n) denote the polynomial that bounds the running 
time of L and thus the number of queries in each round. D’ uniformly picks 
a query (y, j) of the at most PL{n) queries. The value y will be the guess for 
y^f^ = Fk(j/lf If y^j^ has not been among L’s queries in the previous 
rounds, D' outputs the pair {y\!~^\y)- With probability at least l/d^(n)pi(n), 
more specifically, if / = i and R = r and y = y\!\ the value y\^^ has not been 
queried previously. If y['^ has already been queried, D' outputs an arbitrary 
pair. Assume that y^^ has not appeared among the queries. Then D' predicts 
Pk{yk~^^) correctly with probability at least l/2d^{n)pR{n)q{n) for infinitely 
many n, which is not negligble. The claim follows. □ 

We conclude that with probability at least 1 — 1 /2p{n) (for large n) algorithm L 
does not query j) for any j. In this case, D is able to answer all queries 

correctly. After L has stopped and asked for a challenge, D generates a random 
X Gr {0, 1}” and j {1, . . . , n}. With probability 1 — {pL{n) + 1) • we have 
X yf and x has not been queried by L previously. We call such x fresh. 

Let X be H’s challenge, i.e., D is given Fk{x) and r Gr {0, 1}" in random order 
(Qo)Qi)- Let t denote L’s prediction for Ck{x,j). D outputs a guess g G {0, 1} 
as follows: 

— if TTj{Qo) = TTj{Qi) then g is chosen at random 

— if TTj{Qo) yf 7Tj{Qi) then define g such that TTj{Qg) = t 

We remark that each case occurs with probability 1/2 since 7Tj(r) is a random 
bit. In the former case, D is successful with probability 1/2. In the latter case, 
D’s guess is correct if and only if £ is. Note that we require that £ is correct and 
that L does not query Again, using the fact Prob[AAL] > Prob [A] — 

Prob[^ B], this happens with probability at least 1/2+ l/2p(n) for infinitely 
many n. 
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It remains to analyze D's success probability. Let denote the event 

that L does not query fresh (a;) that x is fresh, and correct(£) that L’s 

prediction is right. Furthermore, let easel and case2 denote the events that the 
first case (Tt'jiQo) = ’^jiQi)) resp. the other case (Tt'jiQo) ^ ^j(Qi)) occurs. 
Then 



Prob[6 = g]> Prob b = g A A fresh(a;) 



= Prob 



> Prob 



b = g A A fresh(a;) A easel 

+ Prob b = g A A fresh (x) A ease2 

b = g not(y^‘^^"^^) A fresh(a;) A easel 
• Prob 



Prob 



b = g 
■ Prob 



A fresh(a;) A easel 
not(y^‘^*'"^^) A eorreet(^) A fresh(a;) A ease2 
A eorreet(f) A fresh(a;) A ease2 






1 



1 



2p(n) 

1 

2 2p(n) 



1 - 



PL{n) + 1 



1 



1 - 



1 
2 

Pi(n) + 1 



1 

“ 4 8p(n) 

1 1 
“ 2 16p(n) 



PL{n) + 1 1 

4.2" 4 



1 _ PLjn) + 1 _ PLjn) + 1 

4p(n) 4 • 2" 4p(n) • 2" 



for infinitely many n G IM. That is, we obtain a distinguisher with distinguishing 
advantage that is not negligible. This contradicts the pseudorandomness of F. 
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Abstract. We investigate the learning problem of unary output two- 
tape non deterministic finite automata (unary output 2-tape NFAs) from 
multiplicity and equivalence queries. Given an alphabet A and a unary 
alphabet {a;}, a unary output 2-tape NFA accepts a subset of A* x {a:}*. 
In ^ Bergadano and Varricchio proved that the behavior of an unknown 
automaton with multiplicity in a field K (A-automaton) is exactly iden- 
tifiable when multiplicity and equivalence queries are allowed. In this 
paper multiplicity automata are used to prove the learnability of unary 
output 2-tape NFA’s. We shall identify the behavior of a unary output 2- 
tape NFA using an automaton with multiplicity in A''’“*((a:)). We provide 
an algorithm which is polynomial in the size of this automaton. 



1 Introduction 

The exact learning model was introduced by Angluin Pj. In this model we con- 
sider a learner that does not just passively receive data, but is able to ask queries. 
Some queries, called membership queries, may consist in asking an oracle whether 
a particular string belongs to the target language. Another possibility is found in 
equivalence queries, asking an oracle whether a guess is correct, and obtaining a 
counterexample if it is not. In particular, the following classes were shown to be 
learnable in this model: deterministic automata [5| , various types of DNF formu- 
las. Learnability in this model also implies learnability in the “PAG” model with 
membership queries IHIT7I . The notion of a multiplicity query was introduced 
by Bergadano and Varricchio p| who proved that the behavior of an unknown 
automaton with multiplicity in a field K (AT-automaton) is exactly identifiable 
when multiplicity and equivalence queries are allowed. As a consequence, K- 
automata are PAG learnable from multiplicity queries under any distribution. 

In this paper we consider a nontrivial extension of classical automata, that 
is two-tape non deterministic automata. In particular, we shall consider unary 
output two-tape non deterministic automata {unary-output 2-tape NFAs), and 
investigate the learning problem for this class. A two-tape non deterministic au- 
tomaton (2-tape NFA) is a “finite-state machine” that scans two tapes containing 
words over two disjoint alphabets A and A 2 . A 2-tape NFA accepts a subset of 
A* X A\. A two-tape automaton can also be regarded as a transducer (cf. |2j): 
the first tape is the input tape and the second is the output tape. A unary-output 
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2-tape NFA is a 2-tape NFA which allows a unary alphabet A 2 = {a;} on the 
second tape, so accepts a subset of A* x {a;}*. 

More in general, the notion of multi-tape finite automaton was introduced 
by Rabin and Scott in 1959 m- They showed that, unlike for ordinary finite 
automata, non deterministic multi-tape automata are more powerful than the de- 
terministic ones. This holds already in the case of two tapes. As a central model of 
automata, multi-tape automata have gained plenty of attention. However, many 
important problems have remained open for long time. For non deterministic 
automata (even for two-tape) the equivalence problem is an undecidable prob- 
lem (see jOj); Ibarra has proved that the equivalence problem for unary-output 
2-tape NFAs is also undecidable m- Conversely, the equivalence problem of 
multi-tape deterministic automata has been expected to be decidable. Harju 
and Karhumaki H2] showed that for non deterministic multi-tape automata the 
multiplicity equivalence problem is decidable, that is we can decide if two au- 
tomata accept the same n-tuples of words exactly the same number of times. In 
contrast to Ibarra’s result the multiplicity equivalence problem for unary-output 
2-tape NFAs is then decidable. 

Based on a previous work by Bergadano and Varricchio in this paper 
we identify the behavior of a unary output 2-tape NFA with the behavior of 
an automaton with multiplicity in the set of rational series over a one-letter 
alphabet {x} (A'''“*((a;))) and we provide an algorithm that is polynomial in the 
size of the automaton. 

We remark that in m Yokomori has given a polynomial time algorithm that 
identifies any deterministic two-tape automaton from membership and equiva- 
lence queries. However, we consider non deterministic automata and use multi- 
plicity queries instead of membership queries. We recall that a multiplicity query 
asks the number of accepting paths for a given pair of strings. 

2 Rational series and Multiplicity automata 

Let AT be a field and A* be the free monoid over the finite alphabet A. We 
consider the set K{{A)) of all the applications S : A* ^ K. An element S of 
K{{A)) is called a, formal series with (non-commuting) variables in A or a K-set 
of A*. For any S G K{{A)) and u G A* we will denote S{u) by (S,u). 

Definition 1. Let E C A* . For any S,T G K{{A)), we say S =e T if and only 
if {S, w) = (T, w) for any w G E. S = T stands for S=a*T . 

A structure of a semiring is defined on K{{A)) . In fact. If S and T are 
two AT-sets of A*, then we can define the following operations, called rational 
operations: for each w G A* and a G K 

— the sum S -\-T is given by {S -\-T,w) = (S', w) (T, w); 

— the (Cauchy) product ST is given by (ST,w) = 

— the external operation of K on K{{A)) is defined as {aS,w) = a{S,w); 

— the star operation of S, denoted by S* , is the sum S* = J2n>o -S'", if S is 
proper, that is (S, e) = 0. 
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Definition 2. Let he the monoid of the n x n square matrices equipped 

with the row by column product. A map p, : A* is called a morphism if 

p{e) = Id, where Id is the identity matrix, and p{w) = p{a{) . . ./i(a„), for any 
w = ai ... Gn G A* . 



Definition 3. A K-set is called recognizable or rational if there exists a positive 
integer n, a row-vector A G a column-vector 7 G and a morphism 

p : A* ^ ^nxn jof Qjiy yj £ j[* ^ ( 5 '^ yj'j = Xp(w)"f. The triplet (A, p, 7 ) 

is called a linear representation of S of dimension n. We denote the family of 
these series with K^°‘*{{A)). 

Definition 4. For any string u G A* , and a K-set S of A* , the formal series 
Su and uS are defined by: 

(Su,w) = (S,uw), (uS,w) = (S,wu), \/w G A*. (1) 



Definition 5. Let S be a formal series of K{{A)). The Hankel matrix H{S) of 
S is the infinite matrix whose rows and columns are indexed by the words of A* , 
where the element of indexes u e v is equal to (S,uv). 

It is known that a AT-set S is recognizable if and only if the rank of S is finite 
m- Furthermore, if rank(S) = r is finite, then there exists a linear representa- 
tion of S of dimension r; conversely, if there exists a linear representation of S 
of dimension h, then rank(S) < h. 

We remark that = K{{A)) is a vector space over K. Let 5 be a AT-set; 
the dimension of the subspace of generated by the columns of H{S) is called 
the rank of S and denoted with rank{S). We recall that rank{S) is also equal 
to the dimension of the subspace of generated by the rows of H{S). Let 
u G A* , then the u th row and the u th column of H{S) are the formal series Su 
and uS, respectively. 



We recall now some definitions and notations on multiplicity automata. More 
details are in 11011611 J . Let K he a, field. An automaton with multiplicity in K, 
also called multiplicity automaton, is a 5-tuple M = (Q,A,E,I,F), where A is 
a finite alphabet, Q is a finite set of states, I,F : Q ^ K are two mappings and 
E: QxAxQ^K is a map that associates a multiplicity with each edge of M. 
The maps / and F represent for any state q G Q the multiplicity of q as initial 
state and final state, respectively. We will sometimes call M a AT-automaton for 
brevity. Let w = a\ ... an G A* . A path for w is a sequence 



7 T = (pi,ai,P2), {P2,a2,ps), ..., {Pn, 

where Pi G Q (for i = 1, . . . ,n 1). We denote the set of these paths for w as 
n{w). The multiplicity Multijr) of the path tt is the product of the multiplicities 
of the edges of the path, i.e. Mult{n) = 0 ”=! E{pi,ai,pi+i). 

Definition 6. The behavior of M is a mapping Sm '. A* K defined as fol- 
lows: 

Sm{w) = ^ I{pi)Mult{n)F{pn+i) Ww = ai . . .Qn G A* . 

7 t € n{w) 
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One can associate a linear representation (A,/x, 7) of dimension n with a 
iti-automaton M = ({gi, <72, ■ • ■ , 9n}, Al, if, /, _F) in the following way: for each 
and a in A, Xi = I{qi), = E{qi,a,qj) and 7* = F{qi). One 

can easily prove that, for any w G A* , is the sum of the multiplicities 

of the paths labelled by w from qi to qj and the behavior Sm of M is the 
recognizable if-set defined as {Sm,w) = Vw G A*. 

Any non deterministic finite automaton M can be represented as a Q- 
automaton, since the initial states, the final states and the edges of the au- 
tomaton can be represented by their characteristic functions. In this case, for 
any w G A* , is the number of paths labelled with w from qi to qj and 

{Smiw) is the number of different paths which are accepting for w. 

In general a linear representation (A, /r, 7) of dimension n of a recognizable K- 
set can be regarded as an “automaton” whose set of states is Q = {gi, <72, • ■ • , 9 n}, 
the initial and the final states are defined as AT-sets of Q, while the edges are a 
AT-sets oiQ X Ax Q. Indeed A^ (resp. 7^) represents the multiplicity of qi as an 
initial state (resp. final state) and the multiplicity of the edge {qi,a,qj). 

3 Rational series in one variable 

The set of rational series over a one-letter alphabet {x} is denoted by K^°'*{{x)). 
We may indentify a series S in K^°‘*{{x)) with a sequence (a„)„>o of elements 
of K where a„ = (S', x”) and we denote S by remark that the 

Hankel matrix H{S) of S satisfies the following properties: 

— H{S) is the matrix {ai+j)ij>o. 

— The X* th row coincides with the x^ th column, i.e. for all f > 0 ; 

thus the subspace of generated by the columns of H{S) coincides with 
the subspace of generated by the rows of H (S) . 

— The rank of H{S) is the minimal dimension of a linear representation of S. 

Theorem 1 . Let S = ® series in K''°‘*{{x)) . If there are coeffi- 

cients Co, ... , Cd-i G K such that 

d-i 

^ ^ CjSa;i, (2) 

i=0 

then for any t > 0 and i = 0, 1, . . . , d — 1 

d-l 

Sx^+t = ^ ) A(^ )i,j Sa,j (3) 

i=o 

where fi : {a;}* ^ j{dxd morphism defined as follows: 

fi{x)ij = 1 ifj = i-\-l, /or f = 0, . . . , d - 2 
ft{x)ij = 0 if j ^ i i, /or f = 0, . . . , d - 2 
fi{x)ij = Cj if i = d — 1 
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Proof. By induction on t. Since /t(e) is the identity matrix, if t = 0, then for 



1 = 0 ,..., d— 1, S^i = . If t = 1, then from the definition of fi and 






Eq. (|2|) one easily derives that S^t+i = J2'j=o for i = 0, 1, . . . , d — 1. 



j=0 ' 



We assume, by induction, that for any ^ t, S^i+k = T,j=o for 

z = 0, 1, . . . , d — 1. Thus 

d—1 d—1 

)a: — ^ ^ //(x )z,j )tc — ^ ^ fd(^X '}i,jSx3 + ^ = 



i=o 



i-o 

d—1 d—1 d— 1 d— 1 d—1 

k—0 k—Oj—0 k—0 



Definition 7. Let S be a series in AT’’“*((a;)) and H{S) be its Hankel matrix. 
For eaeh h > 1 let Eh = {e,x, . . . ,x^~^}. We denote by H{S)\h the infinite 
submatrix of FI (S) whose rows are indexed by the words of Eh, and H(S)\hxk 
the finite submatrix of H{S) whose rows are indexed by the words of Eh and 
whose columns are indexed by the words of Ek. 

Theorem 2. If S = X)n>o G K^°’*{{x)) is a series with rank r, then there 
are eoeffieients qo, ... , Qr-i in K such that 

r— 1 

S,r=Y^q,S,.. (4) 

i=0 

Proof. Since S G A'''“*((x)) has rank r, the r -I- 1 rows of H{S) with indexes in 
e,x, . . . , x'^, that is the series S, Sx, . . . , Sxr-i , S'^r, are linearly dependent. Hence, 
there are elements cfis in K, not all zero, such that = 0- One has 

Cr yf 0. If, by contradiction, we assume Cr = 0, then let k be the greatest index 
such that Cfc 0 and Cj = 0 for each k < j < r. Then where 

c' = — (cfc)“^Ci. By Theorem n since the x^ th row depends linearly on the rows 
with index in Ek, then all the rows depend linearly on the rows with index in 
Ek. Thus we may conclude that the rank of H{S) must be at most k < r. This 
is a contradiction, since we know that the rank of H{S) is r. Thus, yf 0 and 
we may write Sx^ = J2i=o where qi = — (cr)“^Ci for z = 0, . . . , r — 1. □ 

Corollary 1. Let S = X^n>o be a recognizable series of rank r and fi be 

the morphism as in Theorem^ taking in account the coefficients qj ’s of Eq. 0- 
One has: 

r— 1 

Sxi+t = fi{x*^)ijSx3 , Vt > 0 and z = 0, 1, . . . , r — 1. (5) 

j=o 

Proof. By Theorem |2| there are coefficients qo, . . . ,qr—i G K such that Sx^ = 
TnZo hSxi- By Theorem in the morphism fi must satisfy Eq. (0. □ 
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Corollary 2. If S = ^ K'~°‘*{{x)) is a series with rank r, then for 

each h,k>r all finite mafrices H{S)\hxk of H{S) have rank equal to rank{S). 

Proof. By Corollary [H all the rows of H{S) depend linearly on the rows with 
indexes in Er- Hence for any h > r the infinite matrix H{S)\h has rank equal 
to rank{S). Since for any n > 0 Sx^ = x’^S, Corollary ^ implies that all the 
columns oi H(S)\h also depend linearly on the columns with indexes in Er', thus 
for any k >r the finite matrix H{S)\hxk has rank equal to rank{S). □ 

We remark that if a series in one variable S has rank r then also the finite 
matrix H(S)\rxr has rank r. 

Corollary 3. If S = ^ K^°’*{{x)) is a series with rank r, then the 

system of r linear equations in the r unknowns qi ’s 

r— 1 

Sx- =Er X! 

is compatible and has a single solution. 

Proof. We can write the system of linear equations for n = 0,...,r — 1 as 
^ 5 ^ ); that is q aiJ^nPi — ^r+n; for n — 0, . . . , r 1. 

The r X r matrix of the coefhcents of the unknowns qi coincide with the finite 
matrix H{S)\rxr whereas the complete matrix of the coefficients of the system 
of linear equations coincides with the finite matrix H(S)\rxr+i- By Corollary 0 
both matrices have rank r. This implies that the system of linear equations of r 
equations in the r unknowns qfs has a single solution. □ 

Theorem 3. If S = J2n>o Onx'^ G is a series with rank r then 

r— 1 r— 1 

Sx~^ = ^ ^ Sx~^ =E-r ^ ^ Qi^x ^ ; C^) 

2—0 2—0 

Proof. The right implication is trivial. Conversely suppose that 

r— 1 

Sx- =Er^QiSxi- ( 8 ) 

2^0 

Since S has rank r, by Theorem El there are coefficients (?q, . . . ,9^-1 m ^ such 
that S'jjr = X)i=o di^x' and hence 



r— 1 




2=0 



(9) 



By CorollaryOthe systems of linear equations Q and Q have a single solution; 
thus 9 * = 9 - and Sx^ = diSx>- □ 
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Let S G . The following statements are equivalent (cf. [inilfip : 

1. (S' is recognizable (rational). 

2. The sequence {an)n>o satisfies a linear recurrence relation, i.e. there exist 
a positive integer m and coefficients cq, . . . , Cm-i such that for all n > 0, 

E m— 1 

j—0 

3. S has a generating function p{x)/{l — q{x)), i.e. S'(l — q{x)) = p{x), where 
p, q are polynomials, and q{e) = 0. 

Let S = J2n>o ^nx'^ be a series of rank r. Let qg, . . . , qr-i be r coefficients 
satisfying Eq. (0). By Theorem 0they also satisfy Equation 0). Then we define 
a linear representation (A, p, 7) of dimension r. The morphism /x : {x}* — > 
is defined as in Theorem ^ taking in account the coefficients qj ’s of Eq. (0 . 
Moreover we set A = (1, 0, . . . , 0) £ and 7 = (ao, . . . , 0^-1) G One 

has: 

Theorem 4. The triplet (A,/t, 7 ) is a linear representation of S. 

Proof. By Corollary one has S^^i+t = j^xi for any t > 0 and for 

X = 0, . . . , r — 1. Therefore 

r— 1 

(5,0;*) = (S'a:0 +t,e) = = \fi{x*-)^, Mt > 0 . 



We remark that (X,fi,j) is a linear representation of S of minimal dimension. 

Theorem 5. Let S = X)n>o ^ K‘^°‘^{{x)) he a series of rank r and let 

qo,.. . ,qr-i he the coefficieriis satisying Eq. (0. Then the sequence (a„)„>o sat- 
isfies the following linear recurrence relation: 



r— 1 

Vn ^ 0, Ur-t-n — ^ ^ qj^n+j- 

j=0 



Proof. Eq. ® implies that for any n > 0, 



( 10 ) 



5 ,, 



h—l r — 1 r — 1 

^ ^ qjS^j+n , (S^r+n , c) — ^ ^ qj (S^J+n , c) Or-|-n = ^ ^ qjO,n+j ■ 
j=0 j=0 j=0 

□ 



Theorem 6. Let S = ^ K‘^°'*{{x)) he a series of rank r and let 

qo, ... , qr-i he the coefficierits satisfying Eq. B)- Then p/(l — (?) is a generating 
function of S , setting (we assume Ok = 0, for k <0) 

r— 1 r— 1 

q — qr-ix qr- 2 X^ qox^ and p = ^(a« — ^ qjan-r+j)x'^ ■ 

n— 0 j— 0 



94 



G. Melideo and S. Varricchio 



Proof. By Theorem 0 for any m > 0, Or+m = implies, for 

any n > r, On = J2'j=o Qj^n-r+j- Hence, the polynomial p can be rewritten as a 
formal series 

r— 1 

p='^{an - y^gjQn-r+j)a:"- 

n>0 i— 0 

Therefore, one has 



(1 -q)S=^ anX^ - qjX^ = 

n>0 j—0 n>0 

r— 1 r— 1 

^a„a:"- '^(^qjan-r+j)x'^ = ^(a„ - ^ gja„_r+j)a:" = p. 
n>0 j—0 n>0 j—0 



n>0 



4 Learning rational series in one variable 

In this section we prove that rational series over a one-letter alphabet, having 
rank < d, are learnable in polynomial time when multiplicity queries are allowed. 
We provide an algorithm which is polynomial in d. The learning model we use 
is the exact learning model with multiplicity queries. In this model we consider 
a learner that is able to ask an oracle the value of the target series S for a given 
string in {x}* . Moreover, we suppose that the positive integer d, which is an 
upper bound to the rank of the target series, is known to the learner. 

By Theorems E] 0 and 0 if we know the exact rank r of S, then we can 
compute the coefficients qo, , q ^- 1 of Eq. to learn a linear representation of 
S, a linear recurrence relation satisfied by the sequence (an)n>o and a generating 
function of S. To compute the coefficients qo, . . . , qr-i, we can solve the system 
of r linear equations in the r unknowns qfs (0). By Corollary 0this linear system 
is compatible and has a single solution, then it can be solved with the Gauss’ 
substitution method whose complexity is polynomial in r. 

If instead of rank{S) we know a parameter d such that rank{S) < d, by 
Corollary El the finite matrix H{S)\dxd has rank equal to rank(S), thus we can 
calculate the exact rank r of H{S) computing the rank of H{S)\dxd- 

We describe now the procedure for exactly identifying S with rank at most d from 
multiplicy queries. Let r be the rank of H{S)\dy.d and Er = {e, x, x ^, . . . , 

Algorithm Learnivar{d) 

— Compute the coefficients qo,q 2 , . . . , qr-i satisfying Eq. ®. 

— Compute the linear representation (A, ft, 7 ) of S as in Theorem 0 

— Compute a linear recurrence relation for S as in Theorem 0 

r— 1 

V?7- ^ 0, ftr+n — ^ V ^j ^n-\-j ■ 



( 11 ) 
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— Compute a generating function p /1 — q oi S as in Theorem El g = qr-ix + 
qr-2X^ + . . . + qox'^ and p = Y.\//Lo{an ~ qjan-r+j)x'^ . 

5 Unary output 2-tape NFA 

A non deterministic two-tape automaton M ( 2 -tape NFA) is a 7 -tuple 



M = (Q,A,A2,Ei,E2,I,F), where 



— <5 is a finite set of states; 

— A and A2 are two disjoint finite alphabets called first-tape and second-tape 
alphabets, respectively; 

— El is the set of the transitions relative to the first tape, that is 

El C {(p, a,e,q) \p,qGQ,aG A}; 

— E2 is the set of the transitions relative to the second tape, that is 

E2 Q {{p,e,b,q) \p,qGQ,bG A2}; 

— I and E are the sets of the initial and final states, respectively. 

A path (p,x,y,q), labelled by (x,y) G A* x A2, with xy e, is a 
sequence of transitions {pi,Xi,yi,p2){p2,X2,y2,P3) ■ ■ ■ {Pn,Xn,yn,Pn-i-i), where 
x = XiX2-..Xn G A*, y = yiy2---yn G A^, pi = p, p„+i = q, and 
{pj,Xj,yj,pj+i) G Ei\J E2, for j = l, 2 ,...,n. We say that (p,x,y,q) is an 
accepting path if p G / and q G E. 

A unary output two-tape non deterministic finite automaton {unary output 
2 -tape NFA) is a particular 2 -tape NFA with a unary second-tape alphabet, that 
is A2 = {a:}. 

Definition 8. The behavior of a unary output 2 -tape NFA M is the map 
Sm:{A* x{ 4 *)-{(e,e)}^(? 

such that, for any {w,x'^) G (A* x {a;}*) — {(e, e)}, {Sm, (w,a;”)) is the number 
of the accepting paths labelled by (w,x"). 

Now, following m, we show that Sm can be described by a map S G 
{{x))^°’* {{A)) , i. e. S' is a recognizable series over the alphabet A, with coef- 
ficients in {{x)) . In fact we set 

{{S,w),x^) = {Sm,{w,x'^)) Vw G a*, Vn > 0 . ( 12 ) 

Assume now that S : A* ^ ( 5 ''“*((a;))’'“‘ is a map associated with a unary 
2 -tape NFA M according to Eq. (1 1 21) . We will show that S is a map from A* 
to ((a;)) (i. e. a ((?’'“*( (a;) ))-subset of A*) and prove the existence of a linear 
representation (A, p, 7) of S. Let M = ({gi, (?2, • ■ • , Qn}, A, {a:}, Ei,E2, 1 , F) and 
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let B = A Li {x}. We define the morphism : B* ^ as follows: for 

i,j = 1,2,.. ,,n, aG A 

= 1 if (qi,a,e,qj) G E\, = 0 otherwise; (13) 

= 1 if {qi,€,x,qj) G E 2 , fi'{x)ij = 0 otherwise. (14) 

For i,j = l,2,...,n we denote by Sij the (recognizable) series over the 
alphabet {x} defined as follows: 

= (15) 

n>0 

For any n > 0, (Sij,x”) = fi'(x^)ij is the number of the paths labelled by 
x^ on the second tape from the state qi to the state qj, that is the number of 
the paths like (qi,e,x’^,qj) (Eq. (Ildll 'l. 

We define a linear representation (A, 7 ) of S as follows: for i, j = 1, 2, . . . , n 

and a G A 

— Xi = 1 ii qi G I, Xi = 0 otherwise; 

— ■ji = 1 a qi G E, = 0 otherwise; 

— fi{a)ij = jyh,k=i Si,hfk'{a)h,kSk,j G Q'^°’\{x)). By Eq. (^, we can note that 
for any w G A* , fj.'(w)ij is the number of the paths labelled by w on the first 
tape from the state qi to the state qj, that is the number of the paths like 
{qi, w, e, qj). By Eq. II 1 611 one easily derives that (/r(a)ij , x") is the number of 
the paths of the kind {qi, a, x'^, qj). In fact a path labelled by {a, x'^) from the 
state qi to the state qj is of the kind {qi, e, x^, qh){q_h, a,, e, qk){<lk, e, a;"""*, qj), 
with qh,qk G Q and m <n. 

For any two states qu, qk of Q, and for m = 0, 1, . . . , n we obtain all the paths 
labelled by (a, a;") from the state qi to the state qj and passing from qh to qk 
reading the letter a on the first tape. The sum of these paths is 

n 

Y,iS^,h,xn^i'{ci)hASk,J,x^-n■ 

m—0 

Ranking q^ and qk in Q, we obtain all the paths labelled by {a, a;") from the 
state qi to the state qj; thus the number of the paths like {qi,a,x'^,qj) is 

n n 

Y, = {^i{a),J,x^). 

h,k—l m—0 

We conclude that 

n n 

mw.o = E( E Y(^^A^^nf^'{a)k,k{Sk,j,x^~n)x^ = 

n>0 h,k—l m —0 

n 

^ ^ {p^^h,kSk,j ■ 

h^k—1 
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Similarly one can prove that for any w G A* , is the number of the 

paths of the kind (qi,w,x'^,qj). Therefore (A,/i, 7 ) is a linear representation of 
S. We conclude by the following important: 

Remark 1. A rational series S G Q'~°'*{{x)) has a generating function p{x)/{l — 
q{x)), where p and q are polynomials and q{e) = 0. Thus, C Q{x), 

i. e. we can embed the ring ((a;)) into the field of rational functions Q{x). 
From this point of view we can consider the series with coefficients in ((a;)) 
as having the coefficients in the field(5(a;). 

6 Learning unary output 2-tape NFA 

We prove that unary output 2-tape NFA’s are exactly identifiable, in poly- 
noimial time, when multiplicity and equivalence queries are allowed. We consider 
a learner that is able to ask an oracle whether a guess is correct, and obtaining 
a counterexample if it is not (equivalence queries), or to ask the number of ac- 
cepting paths of a pair of strings in A* x {a;}*. If S is the behavior of the target 
automaton, then we can obtain the answer to a query like 

{S, {w, a;")) =?, for {w, a;") G w G A* x {a;}*, 

that is equivalent, by Eq. lED, to a query like: 

((5, w),a;") =?, for (w,a;") G w G A* x {a;}*. 

Based on a previous work by Bergadano and Varricchio on automata with 
multiplicity on a field AT jS] , we show that the behaviors of unary output 2-tape 
NFA’s may be identified in polynomial time when multiplicity and equivalence 
queries are allowed. 

6.1 Base Algorithm 

Let S G (^’'“*( (a:) ))’'“* ((A)) be the behavior of the target unary output 2-tape 
NFA M. 

Definition 9. By SubDims{X, we mean the greatest rank of the entries of 
the linear representation of S, i.e. the greatest rank of the elements Xt, 

7 j, p{a)ij, for i, j = 1, . . . , n and a G A. 

In this section we assume known the rank n of S, and the SubDims{X, /a, 7 ) = m 
of the exact linear representation of S. One can prove the following: 

Proposition 1. Let S G {{x)Y°'* {{A)) be a series with rank n and 
let (A, /i, 7 ) be a linear representation of S of dimension n where p{a) G 
(^’'“‘((a;))"’^"' for any a G A, X G (?’'“*( (a;) )^’^" and 7 S (?’'“*( (a:) )" ’^ ^ . For any 
w G A* , if SubDims{X, = m and |w| = h, then 

rank{{S, w)) < {{h — 2)mn^ + (2m + h — l)n -|- 2m -I- 2)n^ -I- 1. 

that is rank{{S,w)) = 0{hmn'^). 
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In the sequel we will denote this estimate of the maximal rank of the entries of 
(S,w) by dimw{n,m). 

Theorem 7. Let S G {{x))Y°‘* {{A)) be a target series. If the parameters 
n and m are known, then for any w G A* , the series = (S,w) G Q^°'*{{x)) 
is exactly identifiable, when multiplicity queries are allowed, with an algorithm 
which is polynomial in n,m and h = 

Proof. Given the parameters n and m, we know that the rational series = 
(S,w) S ((a;)), have rank at most 

dimwiji, m) = {(h — 2)mn^ + {2m + h — l)n + 2m + 3)n^ + 1. 

The algorithm Learniyar{d) of Sec.0, identifies, using multiplicity queries, 
a rational series S' G Q'~‘^*'{{x)) with rank at most d in polynomial time with 
respect to d. Since, for any w G A*, the series 5^’"^ = {S,w) G ^’'“‘((a;)) has 
rank at most dimw{n,m), we may identify with the learning algorithm 
Learniyar{dimw{n,m)). This algorithm is polynomial in dimw{n,m), that is in 
n, m and h = \w\. □ 

Let S G ((?’'“*( (a;) ))’'“‘((^)). Let n = rank{S) and m = SubDims{X, 
for any w G A* , we may describe the algorithm polynomial in n, m and h = \w\ 
for exactly identifying the series = {S,w) G Q''°'*{{x)) when multiplicity 
queries are allowed: 

Algorithm Learn'i^g^^{w,n,m) 

— From Eq. (Iti. I |l compute d = dimw{n,m)] 

— call Learnivar{d) 

We can consider the algorithm Learn'i„^.^{w,n,m) as a multiplicity oracle 
for a target ((a:))-set S, if the parameters n and m are known. Moreover, by 
Remark in the ring (5’'“* ((x)) is embedded in the field of rational functions in one 
variable. Thus, if the parameters n and m of the target (?’^“*((x))-set of A* are 
known, then we can apply the learning algorithm of Bergadano and Varricchio 

m- 

Let (C“‘((^)))"“‘((^)). 

Definition 10. An observation table for S is a triplet t = {P,E,T), where 
P C A* is a prefix-closed set of strings, E C A* is a suffix-closed set of strings 
and T : {P U PA)E — > is a map that gives the observed values of S, 

that is T{w) = Learn'i.„g^^{w, n, m) for any w G {PU PA)E. 

The set P determines a set of rational series {Su \ u G P} that will be useful to 
define the target series S via linear dependencies. 

Definition 11. An observation table {P,E,T) is closed iff, for any u G P and 
a G A, there is a series G {{x)) , for each v G P, such that 
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Definition 12. An observation table (P,E,T) is consistent iff, for any chaise 
of the rational series /3„ S ((x)), for v G P, 

PvSy =eo^y. ^ 

v£P v£P 



Definition 13. P is a complete set of strings for S iff for any u G P , and 
a G A, there is a series Xy gQ^°‘*{{x)), for each v G P, such that 

Sua = Y. (18) 

veP 

Here we only want to show how from such a table (P, E, T) we can guess a 
Off {{x)) -set M{P, E,T) by basing its representation upon the existing linear 
dependencies: 

— Let P = {ui, Uk}, with mi = e. 

— For all a G A, compute jl{a) satisfying 

^Uia —E ^ ,Uj 

3 

Such a matrix exists because the table is closed. 

— Let A = (1,0, ... ,0) and 7 = ((S'„i,e), (S'„ 2 , e), . . . , e)). The value of 

(Suj,e) is found in the table since Uj G P and e G E. Obviously fi{a)ui,uj 
is the value at row i and column j of the matrix /t(a). Let jl{aia 2 ---ar) = 
fi{ai)fi{a 2 )...fi{ar), Oi G A. Define the x))-set M by (M, w) = Xfi{w)^. 

We may now describe an adaptation of the algorithm given in jSj for exactly 
identifiyng S from multiplicity and equivalence queries, if we known the rank n 
of the target series S and the subdimension to of a linear representation of S. 

Base algorithm: 

T ^ ({«}) {e}> T), where (T, e) = Learn^^^ffe, n, to). 

Repeat 

— Make the table closed and consistent {P and E are extended and the entries 
of T are filled in by algorithm Learn'-y^^ffj. We remark that Learn\^g^^ returns 
the correct entries iff the parameter n and to are correct. 

— Make the hypothesized ((x))-set M{P,E,T). 

— Ask for a counterexample t to M{P, E,T) by means of an equivalence query. 

— Add t and its prefixes to P. 

until correct 

Bergadano and Varricchio showed [S| that this algorithm is correct and if 
rank{S) = n then after at most n equivalence queries, we will have a correct 
guess, i.e. M{P, E, T) = S. Hence after at most n iterations, the algorithm stops. 
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6.2 Closing a table 

Given a table (P, E, T) and u G P, we suppose Sua is linearly independent from 
{Sv \ V G P} with respect to E, in the sense that there are not gQ'"°'*{{x)) 
such that Sua =e ^u,vSv In this case ua is added to P, and the table is 

again checked for closure. 

This procedure must terminate. More precisely, if the correct Q’'°‘*{{x))- 
set S is representable with (S,x) = Xfi{x)'^, where A , 7 S and 

^ : A* ^ is a morphism, then at most n strings can be added 

to P when closing the table. In fact, it should be noted that, when ua is added 
to P as indicated above, the dimension of {A/r(u) | v G P}, as a subset of the 
vector space ((x))”, is increased by one. Otherwise, Xfi{ua) would be equal 
to some /?„ gQ'"°'*{{x)) and 

{Sua,x) = (S,uax) = Xfi{ua)fi{x)j = f3uXfi{v)fx{x)j = ^/3„(S'„,a:) 

i.e., Sua would depend linearly on {Sy \ v G P}. Since the dimension of {Xfj,{v) \ 
V G P} is at most n, we cannot close the table more than n times. The above 
discussion does not depend on E. 



6.3 Making tables consistent 

Given a table (P, E, T) and a symbol a G A, consider the two systems of linear 
equations: 



(1) ^ PuSu =E 0 (2) ^ PuSua =E 0, 

veP v£P 



with Pu as unknowns. Gheck if every solution of system (1) is also a solution 
of system (2). In this case the table is consistent. Otherwise, let /?(,, v G P, 
be some solutions of (1) that are not solutions of (2) and x G E such that 
J2veP Pv(Sua,x) yf 0. Add ax to E. 

We suppose that S has a linear representation (A, /r, 7 ) of dimension n; there 
cannot be more than n such additions to E, because every time a new string 
ax is added, the dimension of \ w G E} is increased by one. In fact, if 

li{ax)j = J2weE SwKwh, then 

PviSya, a^) = X! PvX^i{v)^x{ax)^ = ^ PvX^i{v) ^ ( 5 „/r(w )7 = 

v£P v£P v£P w£E 

PuX^{vw)j = Pu(Su,w) = 0 

w£E v£P w£E v£P 



i.e., ax would not have been added to E. 
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6.4 Extended algorithm 

The algorithm is correct iff we know the exact rank of target series S and the 
exact subdimension m of the correct linear representation for S. Otherwise the 
algorithm Learn'i^^^(w , n, m) may fail. In this case the stage in which the learn- 
ing process builds a close and consistent table may not terminate; moreover also 
the base algorithm may not terminate. Bud we know that if the rank of the 
target series is n, then 

— we cannot close the table more than n times (cf. Sec. 16.21 : 

— we cannot add more than n strings to E (cf. Sec. l6..‘-ilL 

— after at most n iterations the algorithm stops (cf. jS]). 

Thus if we suppose that rank(S) = n and SubDims{X, = m, we may 
conclude that they are wrong if the algorithm closes the table more than n times 
or adds more than n strings to E or does not stop after n iterations. In this case 
we may increase n and m and again execute the base algorithm supposing that 
they are correct. 

Extended algorithm: 

n := 0; m := 0; Error := true; 

Repeat 

— If Error then 

• n := n -l- 1; m := TO -l- 1; 

• T ^ ({e},{e},T), where (T,e) = Learn[^^^{e,n,m); 

• Error := false; Control := 0; OldP := 0; 

— while not(r close and consistent) and Control < n and \E\ < n do 

• Make the table closed; 

• If 1^*1 ^ OldP then Control := Control + 1; OldP := |P|; 

• Make the table consistent. 

— If Control < n and \E\ < n 

• then 

* Make the hypothsized (?’^“*((a;))-set M{P, E,T). 

* Ask for a counterexample ttoM {P, E, T) by means of an equivalence 
query. 

* Add t and its prefixes to P”; Controllo := Controllo + 1; 

• else Error := true; 

until correct 

The correctness and the termination of the base algorithm showed by 
Bergadano and Varricchio jS| implies the correctness and the termination of 
the extended algorithm. 
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Abstract. We address the problem of nonadaptive learning of Boolean 
fnnctions with few relevant variables by membership queries. In another 
recent paper |3 we have characterized those assignment families (query 
sets) which are sufficient for nonadaptive learning of this function class, 
and we studied the query number. However, the reconstruction of the 
given Boolean function from the obtained responses is an important mat- 
ter as well in applying such nonadaptive strategies. The computational 
amount for this is apparently too high if we use our query families in a 
straightforward way. Therefore we introduce algorithms where also the 
computational complexity is reasonable, rather than the query number 
only. The idea is to apply our assignment families to certain coarsenings 
of the given Boolean function, followed by simple search and verification 
routines. 



1 Introduction and Problem Statement 

Attribute-efficient learning means the learning of Boolean functions / where only 
an unknown small subset R C V of the variable set V is relevant. A variable 
u G P is called relevant if there exists an assignment of the remaining variables 
in P \ {u} such that the function value of / changes if we switch the value of v 
only. In more simple words, v is relevant if it has an actual influence on /. Let 
Rel{n, r) denote the class of Boolean functions of n variables, r or less of which 
are relevant. 

We consider the model of exact learning by membership queries, that is, 
we may choose arbitrary assignments and ask an oracle about the value of / 
there. Our goal is to identify /. In parallel learning, the learning process consists 
of a sequence of stages. In every stage we may fix a set of queries which are 
asked simultaneously. The query set chosen in any stage may depend on the 
responses obtained in earlier stages. In the setting of adaptive learning we allow 
only one query per stage. The other extreme is nonadaptive learning where only 
one stage is allowed and all queries must be fixed beforehand. (The terminology 
in the literature may differ, cf. 0. However, in the present paper let us use the 
terms as introduced above.) 

Another view of nonadaptive learning is the concept of teaching [HJ |TT?) 
| il 4| | . Assignment families which distinghuish pairwise between all functions from 
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a given class are called universal identification sequences or universal teaching 
sets there, and several classes with polynomial-size universal teaching sets are 
known. 

In the following, an assignment family means any subset of the 2” possi- 
ble assignments. In we give a graph-theoretic characterization of assignment 
families A that are sufficient for nonadaptive learning of functions / € Rel{n, r), 
called r-wise bipartite connected families (see definitions below). Notice that, for 
trivial reasons, deciding whether / G Rel(n,r) would require asking all 2" pos- 
sible queries in the worst case. (Consider e.g. a constant function vs. a function 
whose value deviates on exactly one assignment - then we have r = 0 and r = n, 
respectively.) So we must be sure in advance that / € Rel{n, r), but this is a rea- 
sonable assumption in view of the interesting applications of attribute-efficient 
learning, such as fault detection, diagnosis systems, and combinatorial search 
p II icn] uBi Furthermore, parallelity of queries is essential in applications 
where the tests (queries) can be really performed simultaneuosly, but each test 
is time-consuming. This is the case e.g. in pooling in experimental molecular 
biology. We refer to the mentioned papers for background information. Other 
problems in the field of attribute-efficient learning are studied e.g. in ESI 0 P’ 

Assume always that r is nothing more than a fixed small integer, but n 
may be huge. The problems are studied for general r only because the principal 
structures remain the same. 

In P we study assignment families A being eligible for nonadaptive learning 
of function from Rel{n,r). (That means, every function from this class can be 
identified from the /(a), a G A.) We prove the existence of such families of 
size 0(r^2’' -|- r2'' logn). Actually, a random family of that size is sufficient with 
high probability. We also proposed a pseudopolynomial explicit construction 
with slightly worse size of the results. On the other hand, 17(2’' log n) queries 
are necessary even for adaptive learning. Hence the pure query complexity of 
learning Rel{n,r) is quite well understood. Constructing good families A needs 
some efforts, but this may be done once and for all, for given bounds of n and r, 
and the resulting A may be permanently stored. So this point is not an obstacle. 

However, there remains another serious problem that cannot be ignored: In 
order to apply a nonadaptive learning algorithm A to several instances /, it is not 
enough to know that different functions / from Rel{n, r) yield different response 
vectors [f{a)]aeA- We must also be able to perform the inverse transformation, 
i.e. to extract / from the response vector. 

It is implicit in the proof of our characterization in jZ] that the set R of rele- 
vant variables is the unique minimum (A, /)-feasible set (see definitions below). 
Once we know the set R, the function / is also learned, since our families A have 
the particular property that they induce all possible assignments on subsets of 
size r. (They are r-universal; see definitions below.) So we can focus attention 
on identifying R. 

Clearly, (A, /)-feasibility of any subset can be recognized in 0{n\A\) time by 
lexicographic sorting. So we can find R naively by checking all (roughly) rA jr\ 
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candidate sets for (A, /)-feasibility. This gives a total computational amount 
of 0((r2’'/r!)n’'“*'^ log n) = 0{^yr(2e/ryn^~^^ logn) which is barely practicable 
because of the term. Unfortunately, we did not find a better algorithm 

than exhaustive search, and even worse, we have the impression that such a 
fast transformation might not exist in general. Loosely speaking, it seems that 
the structure of our families A alone does not give enough hints how to find R. 
Trivially, all sets including R are (A, /)-feasible, but the difficulty is that many 
other sets are (A, /)-feasible, too, by inner dependencies in A. 

Thus we should aim at such nonadaptive or parallel learning strategies where 
also the amount of auxiliary computations is reasonable. The above discussion 
does not imply that our r-wise bipartite connected families are of purely aca- 
demic interest. On the contrary, in our favourite parallel algorithm we shall 
essentially make use of them again. 

We remark that in the important special case of nonadaptive group testing 
P 0 0, the analogous problem of reconstrucing the given function from the 
test results is almost trivial. (Group testing is learning of the disjunction of an 
unknown subset of variables, called the “defectives”.) 

The present note is understood as a supplement to [Zl. We propose several 
solutions to our problem, based on the same idea. The choice of an algorithm for 
a concrete instance will depend on several circumstances such as the problem 
size, and the ratio of query costs and computation costs. (It may be assumed 
that queries are physical procedures, of whatever nature, outside a computer 
and therefore expensive, but computations are nowadays cheap and fast.) We do 
not provide novel techniques here, other than new compositions of the formerly 
known structures, but the issue is essential for making parallel attribute-efficient 
learning really accessible. For convenience we formulate our results thoroughly 
in 0-notation, but the hidden factors are always moderate. 

2 Special Assignment Families 

First we list some definitions of useful combinatorial structures, as well as the 
basic lemmas. As explained in the introduction, we need not worry about explicit 
constructions of these combinatorial objects here, and we suppose that they are 
already available for given n and r. Throughout the paper, / means the given 
function from Rel (n, r) that we wish to learn, V is the set of variables, and 
i? C U is the set of relevant variables of /. 

Definition 1. An assignment family A is called r -universal (or r-exhaustive) 
if each of the 2’' possible assignments on each subset of r variables is induced 
by some member of A. By convention, any nonempty assignment family is 0- 
universal. 

Lemma 1. There exist r -universal families of size 0(r2''logn). □ 

This is proved in a straightforward way by the probabilistic method. Explicit 
constructions of slightly larger families are also known. See HH for these matters. 
In the following we take the liberty to apply the 0(r2” logn) bound. 
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The following definition from jjj is not directly used in the sequel and may 
be skipped by the reader, but we include it here, in order to make this note more 
self-contained. 

Definition 2. An r-universal assignment family A on V is called r-wise bipar- 
tite connected if each bipartite graph B(X,Y,z) is connected, where X,Y,Z are 
mutually disjoint subsets of V with \X U Z\ = \Y U Z\ = r, z is an assignment 
on Z, and B{X,Y,z) is defined in the following way: The vertices are all pos- 
sible assignments x on X and y on Y , respectively, and xy is an edge ijf some 
assignment from A induces x, y, and z on X, Y , and Z , respectively. 

Definition 3. Let f be a Boolean function and A an asignment family. The set 
U C V is called {A, f) -feasible if, for all a G A, /(a) depends merely on the 
assignment induced by a on U . 

The central result of |Zj is: 

Theorem 1. An assignment family A can learn functions from Rel(n,r) non- 
adaptively if and only if A is r-wise bimrtite connected. Moreover, there exist 
such families of size 0(r^2'' -|- r2’’ logn)LJ The set R of relevant variables of any 
f G Rel{n,r) is exactly the unique minimum (A, f) -feasible set. □ 

From Theorem El and the remark in the introduction we get: 

Theorem 2. Functions from Rel{n,r) can be learned nonadaptively by 0{r^T^ -\- 
r2’’logn) queries followed by 0{^/r{2e/rYn'^'^^logn) computations. □ 

The next lemma shows how to test, by nonadaptive queries to /, whether a 
given set S' C ii even satisfies S = R. 

Lemma 2. Let be S Q R, s = \S\, and let A consist of all pairs of arbitrary 
assignments on S and assignments from an (r — s) -universal family on V \ S, 
respectively. Then there exist relevant variables outside S (i.e. R\S if and 
only if S is not {A, f)-feasible. The size of A can be bounded by 0{r2'~ logn). □ 

Proof. The “if” direction is trivial, so we prove “only if” . 

Assume that V \ S contains relevant variables. Since \R\ < r and SCR, 
these are at most r — s variables. For any v G R \ S, there exists an assignment 
on R such that / changes if v changes and the values on i? \ {?;} remain fixed. 
Hence, due to (r — s)-universality, there exist two assignments a,a' G A agreeing 
on S but giving f{a) Y f{a')- That means, S is not (A, /)-feasible. 

If we take an (r — s)-universal family as in Lemma Q then the number of 
assignments in A is 0(2®(r — 5)2”“® logn) which implies the asserted bound. □ 

Definition 4. We say that a partition of V into subsets, called bins, separates 
a subset R C V if the elements of R get into pairwise distinct bins. An (r,b)- 
separating family is a family of partitions of V with b > r bins each, such that 
every r-element subset ofV is separated by at least one of these partitions. 

^ In the preliminary version of Q we claimed an 0(r2” log n) bound, but at the moment 
we can only prove the slightly weaker bound as stated here. However this is marginal. 
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Our separating families lie somewhere between shattering families (as in VC 
theory), splitters and perfect hash functions (see e.g. M) ; here we prefer the 
term “separating families” to avoid confusions with these similar concepts. Note 
that the bins are not required to be of equal size here. 

Lemma 3. There exist {r,r'^) -separating families of size 0{r log n). 

Proof. This is a routine application of the probabilistic method. Each element 
of V is thrown independently and equiprobably into one of the b bins. Consider 
a fixed i? C V of size r. The probability of R to be separated is at least 
(1 — r/by ~ Hence the probability that some of the r-element subsets 

remains unseparated by t random partitions is less than n”(l — Choosing 
t = 0(e” /*'rlogn) keeps this probability below 1. Finally let b = r"^. □ 

Definition 5. For a Boolean function f and a partition tt of V , the coarsen- 
ing f-K is the function whose variables are the bins yi,...,yb of tt, such that 
/^(yi, . . . ,yb) is defined to be the value of f when we assign the value of yi to all 
variables in the i-th bin, for i = 1, . . . , 6. Bins that are relevant variables with 
respect to are refered to as the relevant bins. 

A projection of f is any function obtained from f by fixing the assignment 
on a subset of the variables. 

We need a further, rather trivial lemma as a basic step. 

Lemma 4. Functions f S Rel{n,l) can be learned nonadaptively by O(logn) 
queries and 0(n log n) computations. □ 

This is immediately clear, but it should also be noticed that the obvious 
strategy is not good for finding one out of several relevant variables, i.e. it may 
fail if r > 1. 

3 Nonadaptive Attribute-Efficient Learning with Fair 
Total Complexity 

In the following results we implicitly presume that the necessary ingredients (i.e. 
special assignment families) are already available, and so the time to construct 
them is not being counted. As already mentioned, the construction must be done 
only once for given n and r and can be applied then to several / € Rel{n,r). 
Thus, henceforth our input is the function / S Rel{n,r) to be learned, given as 
an oracle. 

Theorem 3. Functions from Rel(n,r) can be learned nonadaptively using 
0(r'*’2” logr log^ n) queries followed by 0((r"^2” logr)nlog^ n) computations. 

Proof. The construction consists of three nested structures. 

(1) Take an (r, 6)-separating family with 6 = of size 0(r log n), as given 
by Lemma 0 
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(2) For every partition tt, take an r-universal family on the set of bins of 
7T as the ground set. By Lemma the size is 0(r2’’log6) = 0(r2’'logr). Now 
we have 0(r^2’'logrlogn) “bin assignments” which may be also considered as 
assignments on V. 

(3) Finally we replace every such assignment a with 0(r^ logn) assignments 
as follows: For every bin y, fix the assignment induced by a on y \?/, and replace 
the constant assignment (all 0 or all 1) on y by the members of a nonadaptive 
learning family for one relevant variable, as given by Lemma|^ This gives a total 
of 0(r'* 2” log rlog^ n) assignments. 

The learning algorithm works as follows: Query simultaneously all assign- 
ments produced above. For each tt and each bin y of tt, consider all O(logn) as- 
signments introduced in (3). Whenever the straightforward search from Lemma 
El succeeds finding a relevant variable of the corresponding projection of / on y, 
this is, clearly, also a relevant variable of /. 

The search will fail in many bins, since there are either no or more than one 
relevant variables. However make sure that all relevant variables are detected in 
this way: Among our partitions there is one, say tt, that shatters R. For every 
relevant bin y of tt, there exists a bin assignment oq on the relevant bins such that 
fjr changes if the assignment of y only changes. Due to r-universality, our family 
contains a bin assignment a inducing ag on the set of relevant bins. Finally, since 
we fix all values of a outside y, we find the unique relevant variable contained in 
y, just by applying Lemma 0 This holds for every relevant bin, thus we find all 
relevant variables. From the preceding discussion it follows also that our family 
is r-universal. Since R is learned, / is learned, too. 

The very simple auxiliary “computations” are only required for setting up 
the query bits and for searching the bins, so the amount of computation is 0(n) 
times the query number. Details are straightforward. □ 

So the computational complexity does no longer contain the hardly accept- 
able term. On the other hand, the query number is now rather large: The 
extra r^logrlogn factor is quite significant. Therefore it is nice that we get rid 
of some annoying factors by allowing two stages. This is presented in the next 
section. 



4 Two-Stage Attribute-Efficient Learning 

In our next result, notice especially the way in which the nonadaptive learning 
families are used. 

Theorem 4. Functions from Rel{n,r) can be learned in two stages using 
0(r^2”logn) queries ond 0((2er)”r^/^ logr logn -|- nlogn) computations. 

Proof. Take a separating family from Lemma 01 and consider any of the parti- 
tions TT. Applying Theorem O to 6 = instead of n, we can learn A nonadap- 
tively by 0(r^2”) queries to /, followed by 0((2er)”r®/^ log r) computations. 
This is simultaneously done in all 0(r log n) partitions tt. 
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Once again, at least one tt separates R. It is clear that we find one of them 
by taking any tt such that g '■= f-n has the maximum number of relevant bins. 

Let us resume: After the first stage we know a partition that separates R, 
and for each of the (at most r) relevant bins y we also know a bin assignment 
such that g changes if we switch the value of y only. So we can apply Lemma 0 
simultaneously in all relevant bins of g, in order to find the relevant variables of 
/. This needs 0(r log n) queries and 0(n log n) computations. The query number 
is dominated by the previous terms. Notice that we need only one further stage, 
thus we have a two-stage algorithm. □ 

Note that the query number is reduced by a factor r log r log n which might 
be crucial in applications with large n and expensive queries. We can further 
improve it with the help of randomness, as shown in the next section. Concerning 
the (2er)’' term in the computational complexity, remember that only small fixed 
r are realistic anyway. 

5 Randomization 

Theorem 5. Functions from Rel{n,r) can be learned by a two-stage Monte 
Carlo algorithm using 0{r^2^ rlogn) queries and 0((2er)’’r®/^ logrlogn -|- 
nlogn) eomputations. 

Proof. We may presume r > 1. 

Proceed as in Theorem El but replace the separating family by a random 
partition into bins. By the proof of Lemma El it separates R with constant 
positive probability. Again we learn the coarsening g applying Theorem El Then 
we search for the relevant variables in the relevant bins by rlogn nonadaptive 
queries and 0(n log n) computations. 

This procedure can fail only if our random partition did not separate R. In 
this case the following happens: Either the search routine fails in some relevant 
bins containing more than one relevant variable (cf. the remark after Lemma 
0 , or we only detect a proper subset S' C i?. In the former case we recognize 
immediately that our random partition was bad, in the latter case the failure 
may be undetected. Therefore this strategy is of Monte Carlo type. 

The bounds follow similarly as in the deterministic counterpart, but some 
factors are dropped. □ 

For safety reasons it may be desirable to have a Las Vegas algorithm where 
we are sure that all relevant variables are found when the algorithm has stopped. 
Finally we propose such an algorithm. Of course, the stage number is no longer 
guaranteed to be 2, and the query number grows again, but it is still better than 
in our deterministic algorithms. 

Theorem 6. Functions f € Rel{n,r) can be learned by a Las Vegas algo- 
rithm having the following expected complexity parameters: 0(1) stages, 0{r^2'^-\- 
r2'~logn) queries, ond 0((2er)’'r^/^ logrlogn -I- r2’'nlogn) computations. 
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Proof. The only difference to Theorem 0 is the verification of the output of the 
two-stage strategy. Let S' C i? be the set of relevant variables we found. With 
help of Lemma El check by 0(r2’' log n) simultaneous queries whether S = R. 
This test is safe. Repeat the procedure until an affirmative result is obtained. 

Since the two-stage learning algorithm in Theorem El succeeds with constant 
probability, we have 0(1) expected stages and 0(r2’'logn) expected queries; it 
remains to add the query bound of Theorem El We need 0(r2’'nlogn) compu- 
tations to analyze the S = R tests. □ 

6 Conclusions 

We pointed out that computing a given Boolean function with few relevant 
variables from the outcome of a nonadaptive learning algorithm is a nontrivial 
problem, and we proposed various parallel learning algorithms for this function 
class, with reasonable amount of afterwards computations. Apparently, the most 
advisable solution at the moment is a two-stage method where an 0(r^2’’) non- 
adaptive learning strategy is applied to size coarsenings of the given function, 
one of which separates the relevant variables. This guarantees a query number 
not far from the optimum. We do not claim that the present complexity bounds 
are already the best. Further research may discover more clever combinations 
of the basic structures, or even an efficient solution to the original problem of 
computing the smallest (A, /)-feasible set from [f{a)]aeA- 

In the introduction we mentioned the equivalence of nonadaptive learning 
and universal teaching sets. In contrast, a teaching set for a fixed function / 
with respect to a function class is an assignment family that distinguishes / 
from all other functions of that class. In i?eZ(n,r), the problem of teaching can 
be easily settled: If / has s < r relevant variables then the teacher presents a 
pairing of all 2® possible assignments on S and an (r — s)-universal family on 
the remaining n — s variables, similarly to Lemma |3 The latter is necessary, in 
order to convince the learner that no other relevant variables exist. So we have 
a teaching set of size 0{{r — s)2’'logn). Amazingly, if / has the full number of 
r relevant variables then a teaching set of size 2’' is sufficient, independent of n. 
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Abstract. Learning from positive examples occurs very frequently in 
natural learning. The PAC learning model of Valiant takes many features 
of natural learning into account, but in most cases it fails to describe 
such kind of learning. We show that in order to make the learning from 
positive data possible, extra-information about the underlying distribu- 
tion must be provided to the learner. We dehne a PAC learning model 
from positive and unlabeled examples. We also define a PAC learning 
model from positive and unlabeled statistical queries. Relations with 
PAC model l jVal84j L statistical query model (' JKea.93) i and constant- 
partition classification noise model (iniszi) are studied. We show that 
fc-DNF and fc-decision lists are learnable in both models, i.e. with far less 
information than it is assumed in previously used algorithms. 



1 Introduction 

The PAC learning model of Valiant ( jVal84j ) has become the reference model in 
computational learning theory. However, in spite of the importance of learning 
from positive examples in natural learning, extending the PAC model in order to 
modelize this kind of learning seems difficult. The reason for it is that it does not 
exist any good way to define the learning error. Suppose for example that / is the 
target concept, that /i is a hypothesis and let /r be the underlying distribution. If 
the error is measured relatively to the positive examples of /, i.e. if error(h) = 
fif{fAh), then over-generalization seems unavoidable: the “full” concept {S* for 
languages, function 1 for boolean functions) is always a good answer. But if the 
error is measured over all the examples, i.e. if error{h) = fj,(fAh), the learner 
cannot differentiate between different distributions whose restrictions on the 
positive examples of / are equal. Consequently, the output concept must always 
be included into the target concept and the learning boils down to learning with 
one-sided error ((EIHz!, ISSnU). But since the underlying distribution can be 
equal to 0 on some positive examples of the target, a learning algorithm will 
not be able to use missing examples to infer negative ones. As a result, it is 
often impossible to be sure that a hypothesis is included into the target concept. 
To sum up, in most cases, positive examples provide not enough information 

* This research was partially supported by “Motricite et Cognition : Contrat par 
objectifs region Nord/Pas-de-Calais” 
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to learn in the PAC framework f |ShvhO| L The above discussion is detailed in 
section 0 

However, there exist classes of concepts satisfying the following property: 
there exists a (polynomial) collection of sets such that for every concept 

/ and g and for every distribution /r, if for every index i, fif(Ei) ~ g^g{Ei) then 
fj,(fAg) ~ 0. This property does not mean that relative frequencies measured 
on the positive examples suffice to determine the target, but that the target 
is determined relatively to the underlying distribution. In other words, 
extra-information about the underlying distribution suffice to make the learning 
possible. These considerations lead us to define a PAC model of learning from 
positive examples where information about the distribution are given by unla- 
beled examples. Note that there are many situations in which it is natural to 
suppose that the learner is given positive and unlabeled data: for example, in 
marketing analysis context, if we want to know which customers are liable to ask 
for some specific service, we have at our disposal a population of customers who 
have already asked for these services (positive data) and the global population 
(unlabeled data). In medical context, a physician knows the patients who have 
developed a given disease (positive data) among his whole practice (unlabeled 
data). 

A similar approach was taken in [HTOTI where a model of concept learn- 
ing from unlabeled examples only is defined: the information about the target 
concept come through a dependence of the generating distribution upon this 
target. 

We also define a model of learning from positive statistical queries where in- 
formation about the distribution are given by unlabeled queries. Relations with 
PAC model 1 jValS4j ). statistical query model f [IKea,93] ) and constant-partition 
classification noise model ( jl JecilTj ) are studied in section 0 We show in section 
El that the classes of A:-DNF and /c-decision lists are learnable from positive sta- 
tistical queries, i.e. with far less information than what is supposed in previously 
known algorithms (IMHD, |K^ . IHEH^). 

A lot of work have been done on learning from positive examples only in 
Gold’s model of learning in the limit f|Co167j. |Ang8U| , [HerStij. jShi9f)|. jZl.hhj). 
The problems encountered in Gold framework, as over-generalizations, are clearly 
related to the questions studied here. But a systematic comparison between the 
two frameworks is out of the scope of this paper. 

2 Preliminaries 

Let Bn be the set of boolean functions from A„ = {0, 1}" into {0, 1}. Let X = 
U„>iA„ and B = U„>iBn- A concept class C over A is a subset of B. We note 
Cn=Cr^Bn■ 

A representation scheme for a concept class C is a function R : C ^ 2^ 
where A is a finite alphabet and such that for each / and /' in E, R{f) is not 
empty and if / yf /', R(f) C R{f) = 0 . The size of a concept / is size(f) = 
77izn{|c||c S i?(/)}. We suppose that R is computable in polynomial-time, that 
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is, there exists a polynomial-time deterministic algorithm which takes as input a 
pair of strings x and c and outputs 1 if /(a;) = 1 with c G R{f), and 0 otherwise. 

An example of a concept / is a pair (x, f{x)), where x is in the domain of /. 
An example (x,f{x)) is positive if f{x) = 1 and negative otherwise. We denote 
by pos{f) (resp. neg{f)) the set of all x such that f{x) = 1 (resp. f{x) = 0). If 
fj, is a, probability distribution on X„ and if / is a boolean function defined on 
Xn, p{f) denotes p{pos{f)). If /i(/) ^ 0, let be the restriction of p. to pos{f) 
defined as follows: Pf{x) = p{x)/p{f) if a; € pos{f) and 0 otherwise. 

A statistical query over A„ is a mapping y : A„ x {0, 1} — > {0, 1}. If / € Bm 
the query \f denotes the mapping defined by Xf(x,y) = 1 iff y = f{x). 

Definition 1. Let C be a concept class over X . Let f € Cn and p be a distribu- 
tion over Xn . 

— The oracle EX{f,p) is a procedure that returns at each call an example 
(x, f(x)) drawn randomly according to p. 

— The oracle UNL{p) is a procedure that returns at each call an unlabeled 
example x drawn randomly according to p. 

— The oracle STAT{f, p) is a procedure that, for every statistical query \ o,n-d 
every t G (0, 1], with input (y, t) returns an approximation of p{{x\x{x, f{x)) 
= 1}) with an accuracy at least t. 

— The noisy oracle EX^+'^~ (f, p) is a procedure which at each call draws an 
element x of Xn according to p and returns (i) {x, 1 ) with probability 1 — 77+ 
and (x, 0) with probability 77 + ifxG pos{f), (ii) {x, 0) with probability l — rj- 
and {x, 1 ) with probability rj- if x G neg{f) 

All these oracles run in unit time. 

A k-monomial on the variables a:i, . . . , a;^ is a conjunction of exactly k liter- 
als. When there is no ambiguity on the set of variables, we note fc-MON the set 
of all A:-monomials and for every boolean function /, we note Mk{f) the set of 
all A:-monomials m such that m(x) = 1 =4> f{x) = 1. The number of fc-monomials 
over n variables is at most (2n)^. A fc-DNF is a disjunction of A:-monomials. A 
fc-decision list (/c-DL) is an ordered sequence / = {mi, 61 ), ... , {mi, bi) in which 
each mi is a fc-monomial, each bi G {0,1} and m; = 1. If u G A„, the value 
f{u) is defined to be bj, where j is the smallest index satisfying mj{u) = 1. We 
choose representation schemes such that the size of a fc-DNF or a fc-DL over n 
variables is bounded by a polynomial in n. We note 1 the boolean function such 
that 1 ( 77 ,) = 1 for every u. 

We take the two basic following definitions in ITOl . 

Definition 2. LetC be a concept class over X . We say thatC is PAC learnable 
if there exist a learning algorithm L and a polynomial p {., ., ., .) with the following 
property: for any f G C, for any distribution p on X , and for any 0 < e < 1 
and 0 < S < 1, if L is given access to EX{f,p) and to inputs e and 6, then 
with probability at least 1 — S, L outputs a hypothesis concept h G C satisfying 
p{fAh) < e in time bounded by p{l/e,l/5,size{f),n). 
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Definition 3. Let C be a concept class over X . We say that C is learnable 
from statistical queries if there exist a learning algorithm L and polynomials 
p {., .), q {., .) and r(., .) with the following property: for any f G C, for any 
distribution p over X, and for any Q < e <1, if L is given access to STAT{f, p) 
and to input e, then 

— For every query (%, r) made by L, the predicate \ can be evaluated in time 

q{l/e,n,size{f)), and 1/r is bounded by r(l/e, n, szze(/)). 

— L halts in time bounded by p{l/e,n,size{f)). 

— L outputs a hypothesis h G C that satisfies p{fAh) < e. 

The standard classification noise model is defined in . It is generalized 

by the constant-partition classification noise (CPCN) model defined in Esna. 
We give below a restricted variant of the CPCN model. 

Definition 4. LetC be a concept class over X. We say thatC is CPCN learn- 
able if there exist a learning algorithm L and a polynomial p {., .) with the 
following property: for any f G C , for any distribution p on X , and for any 
0 < r]+,ri- < 1/2 and 0 < e, <5 < 1, if L is given access to EX^+'’^~ (f, p) and to 
inputs e and S, then with probability at least 1 — 6, L outputs a hypothesis con- 
cept hGC satisfying p{f Ah) < e in time bounded p(l/e, 1/(5, 1 / 7 , szze(/), n) 
where 7 = min{l/2 — 77 +, 1/2 — ? 7 _}. 

3 Is it possible to learn with positive examples only? 

Let / be a target over Xn, let p be the underlying distribution (such that p{f) 

0) and suppose that the only oracle available to the learner is EX{f,pf). Before 
saying whether he is able to learn, we have to define how the error will be 
evaluated. 

The first idea could be to measure the error of a hypothesis h on the positive 
examples only. But if we do so, over-generalization will be unavoidable: 1 is a 
correct answer whatever the target is. 

Then, it seems necessary to take negative examples into account. But if the 
error is measured in the standard way, taking error{h) = p(fAh), another 
problem appears: two distributions p and p' can have the same restriction on 
the positive examples of the target while they are very different on the negative 
examples. More precisely, let xq G X„\f and let (a G [0, 1) such that \a—p{xo)\ > 
1/2. Define 



We have pf = p'^ and \p{x) — p'(xq)\ > 1/2. Therefore, as it is impossible to 
differentiate p and p' with the help of the oracle EX(f, pf), xq must not belong 
to the output hypothesis. That is, learning from positive examples requires the 
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output hypothesis to be included into the target concept. But only for very 
constrained classes, as A:-CNF or lattices, it will be possible to ensure that the 
output concept is included into the target concept. See (' |Mat87| . fShv90p for 
characterizations of such classes. 

A related problem come from the following fact: as it is impossible to differ- 
entiate a negative example from a positive one on which the distribution is equal 
to 0, the learner cannot use missing examples to infer negative information. 

Example 1. Consider the class of 1-DNF on two variables x± and X 2 - Let f = xi, 
g = X 2 , fJ- and g' such that ^(11) = Ai(Ol) = 1/2 and = 1/2- 

Whatever the pair {target, distribution) is among (/, p.) and {g,pl), the sample 
will be S = {11}. Is 01 a negative example or a positive example on which the 
distribution is null? What must be learned? 

Now, in order to make the learning possible, we could demand that each used 
distribution points out only one target concept. That is, we could demand the 
target to be the minimal concept consistent with a sufficiently large sample. 

For example, if the target is X\ + X 2 , we should have /r(01),^(10) and /r(ll) 
not too small. But, in addition to the fact that this restriction seems artificial, the 
simplest classes of concepts remain not learnable. We have shown (see iniSM!) 
that the problem of finding a minimal 1-DNF consistent with a positive sample 
is not polynomial (under the assumption P ^ LOGSNP). 

So, isn’t anything possible ? To our knowledge, the analysis of PAC learning 
from positive examples only usually stopped here. And yet, it is possible to go 
further. The following result shows that, with regard to fc-DNF, the possible 
outputs are somehow determined by positive data. 

Proposition 1. For every e € [0, 1], for every integer n, for every k-DNF f 
and g over A„ and for every distribution g over A„ such that g{f) 0 and 
fj,{g) 7 ^ 0, if for every k-monomial m, 

\fif{m) - gg{m)\ <a = 
where N = (2n)^, then 

< e 

Proof. Let m G Mk{f) such that /i(m) > g{f)/N. Such a monomial exists since 
the number of fc-monomials is bounded by N. 

We have 



gf{m) = p.{m)/g{f) < p.g{m) -h a = p.{m H g)/k{g) + a< p.{m)/g{g) + a 
that is, 

Kg) < Kf) + aKf)Kg)/Km) < Kf) + 

Symmetrically, we can get 



m (/) < k{g) + aN 
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and therefore 

\Kg) - m(/)I < aN 

Now we have 

Kf\g)< g(jn\g)= ^ [^(m n /) - /r(m n 5 )] 

'meMk(f)\Mk(g) meMk{f)\Mk{g) 

= [m(/)wM -/i(5)Mg(w)] 

^ W)\gf{m) - gg{rn)\+ fj.g{m)\fj.{f) - fi{g)\] 

< gg{m)\ + \n{g)- g.{f)\] 

meMk{f)\Mk{g) 

Getting a similar bound for g{g \ f) we get 

g.{JAg)< ^ [|/r/(m) - /rg(m)| + |/r(g) - /r(/)|] < iVa(iV+ 1) = e 

meMkif)AMk(g) 

□ 



This result may seem quite paradoxical. Example Eshows that it is impossible 
to differentiate f = xi from g = X 2 H the only available data is 11 and the 
previous proposition says that the target is determined by the frequencies on 
positive data. In fact, what is determined is not the target but the target 
when the underlying distribution is known. On the previous example, 
proposition ^ says that if the distribution is g, then the correct hypothesis must 
be / while if the distribution is g' , it must be g. 

We think that this is the best we can expect from positive examples in the 
PAC framework: a learning algorithm has to return an approximation of the 
target concept as soon as extra-information about the underlying distribution 
are given. 



4 Learning from positive examples 

In the following definitions, the “positive” information about the target will 
be given by the oracles EX{f,gf) or STAT{f,gf) while the extra-information 
about the distribution will be given by UNL{g) or STAT{1, g). 

Definition 5. LetC be a concept class over X . We say thatC is PAC learnable 
from positive examples if there exist a learning algorithm L and a polynomial 
p {., ., ., .) with the following property: for any integer n, for any f G Cn, for any 
distribution g on X„, and for any 0 < e < 1 and 0 < S < 1, if L is given access 
to EX{f, gf), UNL(g) and to inputs e and S, then with probability at least 1 — 15, 
L outputs a hypothesis concept h G Cn satisfying g{fAh) < e in time bounded 
& 2 /p(l/e, l/(i, size{f),n). 
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Remark that if a concept class C is PAC learnable and if there exists a learning 
algorithm for C which does not use negative examples of the target, then C is 
PAC learnable from positive examples. Therefore, fc-CNF ( [Val84p and integer 
lattices l [HS W92| 'l are learnable from positive examples. 

A similar approach has been taken in mm- A model of unsupervised 
learning is defined in which the task of the learner is to identify a probability 
distribution or more precisely, its high probability-density areas, from unlabeled 
examples. Then, a learning Without A Teacher model is proposed, in which it 
is assumed that “for points outside the target the distribution density is lower 
that a certain threshold a, while inside the target the density exceeds some 
value /3 > a” . A characterization of learnability is given, from an information- 
theoretic point of view; but the computational complexity of learning inside 
specific hypothesis spaces is not studied. 

Definition 6. Let C he a eoncept class over X. We say that C is learnable 
from positive statistical queries if there exist a learning algorithm L and 
polynomials p {., ., .), q{., ., .) and r(., ., .) with the following property: for any in- 
teger n, for any f G Cn, for any distribution p. over X„, and for any 0 < e < 1, 
if L is given access to STAT{f,pf) and STAT{1, p) and to input e, then 

— For every query (%, r) made hy L, the predicate \ can he evaluated in time 
q{l/e,n,size{f)), and 1/r is hounded hy r(l/e, n, szze(/)). 

— L will halt in time hounded by p{l/e,n,size(f)). 

— L will output a hypothesis h G Cn that satisfies p{fAh) < e. 



Proposition 2. Let us note POSQ (resp. Q, CPCN, POSEX, PAC) the 
set of classes learnable with positive statistical queries (resp. statistical queries, 
constant partition classification noise, positive examples, positive and negative 
examples/ Following relations hold: 

POSQ C Q C CPCN C POSEX C PAC 



Proof, (sketch) 

POSQ C Q: the oracles STAT{1, p) and STAT{f, pf) can easily be simulated 
using the oracle STAT{f, p) (see complete proof in |Den98p . 

Q C CPCN: This result is proved in [Dec97| . 

CPCN C POSEX: (with the help of an anonymous referee). Let C be a con- 
cept class in CPCN, / be a concept of C„, /i be a distribution over A„ such 
that p{f) / 0, 0 < e < 1 and 0 < (5 < 1. Let v be the distribution defined 
by: 



^ j2M/(a:)/3-kM(a:)/3 A x G pos{f) 
l^(a;)/3 otherwise. 

We can easily verify that the noisy oracle EX^+’^~ {f,i/) with rj- = 0 and 
V+ — 2 +^il(f) simulated this way: with probability 2/3, get an example 
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from EX(f, fj.f) and label it +, and with probability 1/3, get an example 
from UNL{fj,) and label it A negative example of / is always labelled a 
positive example of / is labelled - with probability ^(/)/3. 

Note that 1/2 — 77+ >1/6 and that for every subset A of A„, ^{A) > /i(A)/3. 
Therefore, in order to learn C from positive examples with accuracy param- 
eter e, run the CPCN algorithm with accuracy parameter e/3 and at each 
call of EX"^+'"^- {f, y), call EX{f,^f) with probability 2/3 and UNL{fx) 
with probability 1/3 and return the result according to the labelling defined 
above. 

POSEX C PAC: the oracles UNL{^) and EX{f,^f) can easily be simulated 
using the oracle EX{f,y). 

Remark that the class of parity functions can be learned in PAC model using 
positive examples uniquely 1 |HSW92) . |Kea93| L It is proved in |Kea9;i| that it 
is not learnable with statistical queries. Therefore, the class of parity functions 
is in POSEX but not in Q. 

We can’t prove that POSEX (resp. POSQ) is strictly included into PAC 
(resp. Q). We conjecture that the class composed of complementary sets of lat- 
tices is not learnable from positive examples (while it is PAC learnable). □ 

As a corollary, the previous proposition proves that fc-DNF and A:-DL are 
learnable from positive examples since they are learnable from statistical queries 

Moreover, if the learner knows the underlying distribution and can simulate 
it within polynomial time, he can learn any class in Q from positive examples 
only. For example. 

Corollary 1. The classes of k-DNF and k-DL are learnable from EX(f,uj) 
only under the uniform distribution u. 

Proof. The oracle EX{l,u) can be simulated by tossing a coin. □ 

A concept class learnable from statistical queries can be not learnable from 
positive statistical queries with the same space of queries. For example, let 
C = {f, g} C 2^“’^^ where / = {a, b} and g = {a} and let x(a;, y) = 1 if y = 1 and 
x(x,y) = 0 otherwise. We have STAT{f , yL){xc) — 1 and STAT(g, fi)(xiT) ~ 
/r(a) while STAT{f , g,f){x^T) ~ 1 and STAT{g, g,g){x^T) ~ 1. Therefore, C is 
learnable using statistical query x but it is not learnable using positive (restric- 
tion of) statistical query y. 

We prove in the next section that fc-DNF and fc-DL remains learnable from 
positive statistical queries. 

5 Learning from positive statistical queries 

Definition 7. Let C be a concept class over X . We say that the weight of con- 
cepts of C can be estimated from positive statistical queries if there exist an 
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algorithm W and a polynomial p(., with the following property: for any inte- 
ger n, for any f G Cn, for any distribution p, over X„, and for any 0 < e < 1, 
if W is given access to statistical queries oracles STAT{f,p,f) and STAT{1, p) 
and to input e, then W outputs a number p{f) such that |/t(/) — p{f)\ < e and 
W halts in time bounded by p{\ / e,n, size{f)) . 



Theorem 1. Let C be a concept class over X learnable from statistical queries. 
If the weight of concepts of C can be estimated from positive statistical queries 
then C is learnable from positive statistical queries. 

Proof. Let L be the learning algorithm from statistical queries and let W be the 
algorithm which evaluates the weight of concepts of C. The following algorithm 
learns C from positive statistical queries. 



Learning algorithm L' 

Input: e, n 
Begin 

Run algorithm L 

Each time algorithm L asks the oracle STAT{f, p) 
in order to evaluate the query (y, t) 

Run W{t/4) and let p{f) be the result 
Let be the query defined by y) = x(a;, 0) 
Let be the query defined by = x(a;, 1) 

Let p^o = STAT{1, p,x^ , t/4:) 

Let p\ = STATlf,pf,x°,T/A) 

Let Aji = STAT{f,pf,x^,T/A) 

Return p^o + (/tji — /fJo)A(/) to algorithm L 

End 

Output: the output of algorithm L 



It is easy to verify that 

K{x\x{xJ{x)) = 1 }) 

= K{x\x{x, 1) = 1 A f{x) = 1}) + p{{x\x{x, 0) = 1 A f{x) = 0}) 

= Lf{{x\x{x, 1) = l})li(/) + {p{{x\x{x, 0) = 1}) - p{{x\x{x, 0) = 1 A f{x) = 1})) 
= l{{x\x(x,Ci) = 1}) + {pf{{x\x(x, 1) = 1}) - pf{{x\x(x,0) = l}))p(f) 

The proposition follows. □ 

We now apply this result to /c-DNF and /c-DL. 



Proposition 3. The class of k-DNF formulas is learnable from positive statis 
tical queries. 
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Proof. Let / be a fc-DNF over n variables and to be a fc-monomial over X„. Let 
/r be a distribution over such that fi{f) 0. We have 

M/M = m(/ n m)/fi{f) < fj.{m)/fi{f) 

i.e. for every to € fc-MON such that yf 0, 

m(/) = Kf n m)/nf{m) < n{m)/nf{m) 

and if to is in Mk{f), i.e. if to /, 

^l{f) = 

Therefore, we get 

/i(/) = TOm{-^^^p^|TO € fc-MON, /r/ ( to) yf 0} 

and since there exists a monomial to in Mk{f) such that /j./(to) > 1/A^ (where 
N = (2n)^), we have 

m(/) = TOw{ |to G fc-MON, /i/ ( to) > 1/iV} 

The following algorithm computes an estimation of /i(/). 



Learning the weight of a fc-DNF 

Input: e, n 

Begin 

Let T = (g^ 2 ) 

For all fc-monomial to 

compute fif{m) = STAT{f , .Xm^r) 

{Xm{x,y) = 1 iff 2 / = m{x)} 
compute fi{m) = STAT{1, jjL,Xm,T) 

EndFor 

Let /x(/) = rniri{jl{rn) / jxf{rn)\rn G fc-MON, / x/(to) > 1/A^ — t} 

End 

Output: /t(/) 

We have y{f) = rnm{yi{m) / yf{rn)\rn G fc-MON, /1 /(to) > 1/N — t}. 

Verify that if (ifijn) > 1/iV — t, 

\ix{rn)/fif(rn) - y{m) / ^if(rn)\ < 2t / < 2t/[{1/N - t)(1/N - 2r)] 
and since 1/A^ — t > 1/N — 2r > 1/(2A^) we have 

IA(/)-M/)I <2r41V" = e 



We can now apply theorem n 



□ 
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We now prove an analogous result for ^-decision lists. The proof is trickier 
in this case. 

Theorem 2. The class of k-decision lists is learnable from positive statistical 
queries. 

As in previous proposition, we just have to prove that the weight of /c-decision 
lists can be estimated from positive statistical queries. 

Let / be a k-UL over n variables and let /r be a distribution over A„ such 
that p.{f) 7^ 0. 

Let 

= {x € A„|Vm G /c-MON, m(x) ^ 0} 

Let be the complementary set of . We have 

= [J{to G /c-MON|^/(to) = 0} 

We show below some properties of M^. 

Lemma 1. 1. n{f \ M^) = 0 

2. for every subset A of Xn, ytf{A) < p,{Ar] M!^)/pi{f). 

3. if (to, 1) is the first (positive) term of f such that pif{m) yf 0, then p.f{m) = 

Mf)/p.{f) 

4- pi{f) = min{p,{mC\ Mj)/p,f{m)\m G k-MON, p,f{m) yf 0} 

Proof. 1. let X G f\M^^, and let to G /c-MON such that m{x) = 1 and ptf{m) = 
0. As a: G /, we have p.{x) = 0. 

2. ^,f(A) = m(A n /)//i(/) < HA n (/ \ m;)) + ^(A n m;)]^(/) < ^(A n 

3. let a; G TO n such that p,{x) yf 0. For every term (to', b) preceding {m, 1) 
in /, pLf{m!) = 0 and since x G , we have m'(x) = 0. Therefore x G f 
and pi{m n M^) = qi{m n /). 

4. apply the two previous points. 

□ 

The last relation is much less robust than the analogous one for fc-DNF. This 
is because /^/(to) = pL{mC\ M^) / p.{f) can be true for only one monomial to, and 
moreover, the weight of to under p, can be very small. In the following learning 
algorithm, we build a distribution u, close to p and such that the first positive 
term (to, 1) of / such that v/irn) yf 0 has not too small a weight under v. 



Learning the weight of a /c-decision list (WDL) 

Input: e, n 
Begin 

Let N = (2n)^, a = 5^7 D = o;/4, T2 = (ea^)/64 

{We now build a set M such that for every fc-monomial to. 
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Hf{mn M) is null or not too small} 

{M is the complementary set of M| 

M = 0 

MONa = fc-MON 

Loop 

For all /c-monomials m G MON a 

ask = STAT{f,^f,XmnM,Ti) 

EndFor 

If Vm G MONa, n M) > a then 
ExitLoop 
Endlf 

AUX ^ {m G MONa\fj,f{m n M) < a} 

M ^MU[j{mG AUX} 

MONa ^ MONa \ AUX 
{Note that M is a /c-DNF and that 
the queries XmnM can be evaluated in polynomial time} 
EndLoop 

For all fc-monomials m in MON a do 

ask /t(m n M) = STAT{1, /x, XmnM, T 2 ) 
ask M) = STAT{f,^if,XmnM,T 2 ) 

EndFor 

compute /i(/) = G MONa,} 

End 

Output: /{(/) 



Lemma 2. The previous algorithm runs in polynomial time and outputs fi{f) 
such that |/t(/) — /i(/)| < e- 

The proof, a bit technical, relies on several lemmas. 

Suppose in all the following that we have run the algorithm WDL. 

Lemma 3. /i(M n /) < N{a + ti)/x(/) < 1. 

Proof. Each time a monomial m is added to M in the previous algorithm, this 
is because fi.f{M D m) < a which implies )rf{M Dm) < a + ti. The quantity 
added to M n / is p.{M n m n /) < {a + Ti)/i(/). And because the number of 
fc-monomials is less than N , we get the result. □ 

Let V be the distribution over A„ defined by : 

^{x) = 0ifa;GMn / and ^{x) = fj,{x)/p,{M U /) otherwise. 

We prove some facts about ly which show that iy{f) is close to /x(/): 

Lemma 4. 1. We have M = M} . 

2. for every subset A of X^, we have \iy{A) — /i(A)| < 2N{a + ri)/i(/). 
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3. we have 1 — 2N{a + ri) < ’^{f)/ia{f) < 1 + 2N{a + ri). 

Proof. 1. — Let x G M. There exists m G fc-MON such that m{x) = 1 and 

m C M. Then, H /) = Vf{m) = 0 and x G M^. 

— Let X G Mj. There exists m G /c-MON such that m{x) = 1 and = 

v{mr\f) = 0. Then fif{mr\M) = fi{mr\M n /)/ir(/) = fj,{M U f)h'{mr\ 

M n f)/ ^i{f) = 0. Therefore, m cannot be in MONa since ti < a. We 
have X G m G- M . 

2 . we have 

\v{A) - ^i{A)\ < ^ \v{x) - ^l{x)\ < ^ \v{x) - ^l{x)\ + ^ \v{x)-^i{x)\ 

xeXn xeUnf xeMuJ 

</r(Mn/)+ ^(a;)(l//r(M U/) - 1) 

xeMuJ 

< n /) + 1 - /i(M U 7 ) < 2^(M n /) < 2N(a + n)^i{f) 

3. applying the last point, we get : 

-2N{a + Ti)^(/) < -m(/) + v(f) < ‘2.N{a + ri)/r(/) 

that is 

1 - 2N{a + Ti) < ^(/)//i(/) < 1 + 2N{a + n) 

□ 

We can now prove the lemma 0 

Proof. First note that the algorithm runs in polynomial time. 

The only thing to prove is that |/r(/) — /t(/)| < e. 

— From lemma 0 we have | ~ 1 1 < 2 (a + ti ) 

— Let m G MONa. We have 

/r(m n M) fi{m n M) ^ n M) — Hf{m n M)fi{m n M)\ 

fif{mr)M) fif{mr]M) ~ ^f{mDM)fif{mr)M) 



/j,y(m n n M) 

^ 2t2 ^ ^ 

“ {a — Ti){a — Ti — T2) ~ 

since ri < a/4 and T 2 < a/4 
— We also have 

v{mr\M) i^(/) 

n{M ijJ) f) 

~ fi{M u 7) Km n / n M) ^ “ Km n / n M) 

KmdM) v{f) 

Kf) 
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— Using this relation, we get 

n M) pi{mC]M) 

Vf{m) 

^ v{f) njmfMvI) _ /t(wnM) /r(mnM) v{f) _ ^ 
~M(/)M/(wnM) jj,f{mr\My yf{mr\My ^i{f) 

^ 21 ^(/) II 

< + 4iV(a + Ti) 

for every m G MONa 

— Now, let mg G MONa such that i/{f) = and mi G MONa such 

that A(/) = -0^^- 



Wif)- Kf)\ = 



i^(mo n M) /i(mi n M) 



Vf{mo) /{/(rniPM)' 



< 2Max{\ 



iy{mr]M) fi{mr]M) 



I'fim) jif(rnC]M) 



\m G MONa} 



< + 8N{a + Ti) 



— To end the proof, 

Im(/) - A(/)l < Im(/) - i"(/)l + Hf) - A(/)l 

and since |^(/) — J^(/)| < 2N{a + ti) from lemma 0 



Im(/) - A(/)l < + 10iV(a + Ti) < 



□ 



As in corollary ^ if the learner knows the underlying distribution and can 
compute it within polynomial time, he can learn fc-DNF and fc-DL from positive 
queries only. 



6 Conclusion 

The models defined in this paper show that it is possible to describe learning from 
positive data in the PAC learning framework, as soon as information are given on 
the underlying distribution. Moreover, learning from positive and unlabeled data 
seems natural in many contexts. Lastly, these results show that many classes 
learnable in the PAC model are eventually learnable with much more severe 
constraints: positive and unlabeled queries provide far less information than 
positive and negative examples. In other words, classes which are learnable in the 
PAC framework are so not only because they meet the PAC model requirements 
but also others more restricting. 
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Abstract. Reviewing structured weight-based prediction algorithms (SWP 
for short) due to Takimoto, Maruoka and Vovk, we present underlying 
design methods for constructing a variety of on-line prediction algorithms 
based on the SWP. In particular, we shown how the typical expert mo- 
del where the experts are considered to be arranged on one layer can 
be generalized to the case where they are laid on a tree structure so 
that the expert model can be applied to search for the best pruning in a 
straightforward fashion through dynamic programming scheme. 



1 Introduction 

Based on the mistake bound model, multiplicative weight-update prediction al- 
gorithms have been studied which predict the classification of an instance from 
environment at each time step (See [1],[2],[5],[7] and [8]). As opposed to the 
typical PAG learning model, the mistake bound model makes no assumptions 
about the way the sequence of instances, and hence the sequence of outcomes 
specifying the classification, is generated. Instead, in the mistake bound model 
we usually assume a pool of experts £ which are supposed to make binary values 
so that, using these binary values the experts give, the prediction algorithms 
make its own prediction. 

Although it seems that these models have been mainly investigated separa- 
tely so far some topics having to do with both of these two models begin to be 
explored recently. In fact Freund and Schapire used the on-line prediction mo- 
del to derive a new boosting algorithm [2] ; Kearns and Mansour [4] constructed 
the efficient pruning algorithm that, in the PAG setting, enjoys a strong per- 
formance guarantee of the style of the prediction model. In this paper we give 
various multiplicative weight-update algorithms in [7] and present computatio- 
nal scheme behind these algorithms in as simple a form as possible. It is our hope 
that exploring the computational mechanism in the prediction model helps us 
establish results that share aspects from both the PAG and prediction models. 

We start with explaining the aggregating algorithm due to Vovk ([8]). After 
reviewing Vovk’s work, we give on on-line algorithm that finds the best pruning 
through dynamic programming scheme, and then present an on-line algorithm 
that is competitive not only with the best pruning but also with the best predic- 
tion values. Finally we notice that the later algorithm is so simple that it can be 
generalized to the case where, instead of using decision trees, data are classified 
in some arbitrarily fixed manner. 



M.M. Richter et al. (Eds.): ALT’98, LNAI 1501, pp. 127- IT^ 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 
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2 Online Prediction Model 



In the most primitive version of the prediction algorithm, which is called the 
weighted majority algorithm, the master algorithm produces its output based 
on the majority of weighted voting of the experts. In this algorithm, each of 
the N experts has initial weight 1, and in each trial the weights are multiplied 
by 0 < /3 < 1 in the case of a mistake and are lefted unchanged otherwise. 
The weighted majority prediction model can be generalized to the case where 
it is allowed to hedge in predictions: The master algorithm and the experts 
are allowed to output values in [0,1] rather than binary values 0 or 1. In this 
paper we adopt the generalized prediction model described as follows. Let the 
prediction space, denoted Y, be [0, 1], and let the outcome space, denoted Y, be 
{0, 1}. And let the instance space be denoted by X. A prediction algorithm has 
the pool of experts denoted hy £ = {£\, . . . ,£n}- At each trial t = 1, 2, • • • a 
prediction algorithm A receives an instance Xt € X and generates a prediction 
yt G [0, 1]. Likewise, at each trial t, every expert £i makes a prediction &Y for 
the instance Xt, and sends its prediction to the master algorithm. The algorithm 
somehow combines these predictions in order to make its own prediction ijt G Y . 
After an outcome yt G {0,1} is observed (which can be thought of as the correct 
classification of Xt), the master algorithm and the experts suffer loss given in 
terms of a loss function denoted A: the master algorithm suffers loss \{yt,yt), 
whereas the zth expert suffers loss \{yt,^l). A typical example of such a loss 
function is given as \{yt,yt) = \yt — yt\, which is called the absolute loss function. 
In the following arguments we’ll see that the sequence of instances xfs eventually 
does not play any essential role. Usually we start our argument assuming a 
sequence of outcomes arbitrarily given. 

The prediction yt G [0, 1] can be typically interpreted as follows: The algo- 
rithm predicts yt = 1 with probability yt and yt = 0 with probability 1 — yt- The 
cumulative loss of a prediction algorithm A and that of the ith expert over T 
trials are given by L^iy) = Sj^.^\{yt,yt) and Li{y) = Sj^.^X{yt,il) for outcome 
sequence y = (yi, . . . , y^), respectively. The goal of the prediction algorithm A 
is to minimize the cumulative loss LA{y) = A^j^A(yt, yt) for arbitrary outcome 
sequence y = (yi, . . . ,yr), T > 1. The cumulative loss is simply called the loss 
of A. 

We begin with reviewing Vovk’s on-line prediction algorithms called the ag- 
gregating algorithm and the aggregating pseudo-algorithm. In the on-line pre- 
diction model, the prediction algorithm maintains a weight wj G [0, 1] for each 
expert £i that reflects the actual performance of the expert £i up to time t. For 
simplicity, we assume that the weight is set to w\ = 1 for 1 < f < A^ at time 
t = 1. It is crucial how to update the weights after time t = 1. In order to 
specify the update rule we introduce the parameter (3 G (0,1) of the algorithm 
called the exponential learning rate. At each trial t, after receiving the predictions 
, . . . , from the experts, the master algorithm computes, for each y GY, the 
function r(y) which is specified by 
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N 

pr(v) 

2=1 



or equivalently 



N 



r{y) = 



where w* are the normalized weights: 



W* = 



Wi 



z2^=l w\ 



By definition, for each y G Y, r{y) gives the weighted average of losses of 
the experts. In other words, the function r does not provide information as to 
which outcome in Y is more likely to happen, but gives the weighted average of 
loss for both cases of yt = 0 and yt = 1- After receiving the correct classification 
yt &Y , the algorithm updates the weights of the experts according to the rule 



for 1 < i < and 1 < t < T. So the larger the expert Si's loss is, the more 
its weight decreases. It will be seen later that, for any outcome sequence y = 
(j/i, . . . , j/t)) is bounded from above in terms of the each expert loss 

Li{y) = S'[^iX{yt,^l), where rt is the weighted average of losses of the experts 
at time t. So in order to bound from above the loss of the prediction algorithm 
La{u) = ^T=i^iyt^yt) in terms of loss of the best expert, we want to have an 
inequality of the form \{yt,yt) < crt{yt) that holds for any yt G Y when some 
yt G Y is chosen appropriately, where c is some constant. With the arguments 
above in mind, we define a j3-mixture as the function r : F — >■ [0, oo) defined as 

r{y) = log^ ^ 



for a probability distribution P over Y . Let c(/3) be the infimum of the real values 
c such that for any /3-mixture r there exists y GY such that A(y, y) < cr(y) for 
any y G Y. The constant c(/3) will be called the mixability curve. Throughout 
the paper we assume that the infimum in defining c(/3) is attained. Sufficient 
conditions for c(/3) to exist are given in [8] . By definition there exists a function, 
denoted S/s, from /3-mixtures to [0, oo) such that 

Ky,^y{r)) < c{!3)r{y) 

for any /3-mixture r and any y. The function will be called a substitution 
function. Note that Yp{r) depends not only function rt and constant /3, but also 
the underlying loss function A. The function r which gives the average loss of the 
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experts for each outcome will be called a pseudoprediction, whereas a prediction 
(denoted so far) given by an element in Y will be called a genuine prediction 
when we need to make it explicit that the prediction is not a pseudoprediction. 
The first type of the prediction algorithm yields a pseudoprediction and is called 
a Aggregating Pseudo- Algorithm (APA for short), whereas the second type of the 
algorithm produces a genuine prediction and is called a Aggregating Algorithm 
(AA for short). 

The constant c(/3) and the substitution function A^(r) have been obtained 
for popular loss functions such as the absolute loss, the square loss and the 
log loss functions [8]. In particular, when we consider the absolute loss function 
HVj y) = \y — y\ ior y G Y = {O, l}, y G Y = [0, 1], it was shown [8] that 



c(/3) 



ln(l//?) 

21n(2/(l + /3))’ 



and 



where 



/3 ^ 






r(0) — r(l) 



loW 



1, if t > 1, 

0, if t < 0, 
t, otherwise. 



Note that the absolute loss \y—y\ is exactly the probability of the probabilistically 
predicted bit differing from the true outcome y. 

We consider that our prediction algorithms consist of two parts : one keeps 
track of rt{y), i.e., the average loss of the experts for each case of yt = 0 and 
j/i = 1 which is computed provided that actual outcomes up to t — 1, and hence 
the actual weights up to t — 1, are known for the N experts; the other makes a 
prediction on outcome at time t by applying the substitution function 27,3 that 
selects good prediction based on the average rt{y) of losses over the experts 
when the outcome is y. We can say that the prediction at time t is made by fully 
exploiting the information the algorithm can get before the outcome at time t 
is revealed. 

For ease of exposition we first give the prediction algorithm that consists 
of only the first part, and then present the algorithm that consists of both 
parts. As shown in the next section the first type of the algorithm will be useful 
when we don’t need to produce a prediction in every trial. When we deal with 
the aggregating pseudo-algorithm, it is convenient to assume that not only the 
master algorithm but also the subordinate experts output pseudopredictions: for 
y GY a, pseudoprediction ^ takes the real value ^ (y) which is interpreted as the 
loss of ^ for outcome y. We use the same symbol ^ which is used to represent a 
genuine prediction ^ G T as well. A genuine prediction ^ G Y can also be viewed 
as the pseudoprediction : T — >• [0,oo) defined as ^'{y) = A(y,^) for y G Y. 
So the master algorithm computes the pseudoprediction based on the equation 
obtained by replacing A(y,^|) in the equation r{y) = \ogfj by 
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N 

r{y) = 

i=l 

The complete description of the algorithm APA due to Vovk ([8]) is given as 
follows. 

Algorithm 1 (APA(/3)) 
for i G do 

wf := 1 

for t := 1, 2, . . . do 
receive 
for y G Y do 

n{y) := log^E*/3^‘*^^^^‘ 

output rt 
receive yt 

for i G {1, . . . , N} do 

The loss of APA(/3) and that of the expert Si for j/ = (j/i, . . . , j/t) are given 
by 

T 

^APA(/3)(y) = 

and 

T 

L^{y) = 

t=i 

respectively. A described above, the loss for y = {yi,. . . , yr) is defined to be just 
the sum of the weighted average of losses for the outcomes yi taken in trials. For 
an arbitrarily fixed outcome sequence y = (j/i, . . . , yr) G y*, we have 



^APA(/3)(y) = ^rt{yt) 



t=l 

T 



N 









t=i 

T 



yi = l 



= Eiog, 



= log/3 ( E “ ^Og/3 ( E 



N 



N 



\i=l 
^ N 



= log^ E ^ 



Liiv) 
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Since /3 < 1, we have 



LAPA{0)(y) < log^ 

= log^ 

for any 1 < i < fV, which implies the following theorem. 

Theorem 1. (Vovk, [8]) Let 0 < /3 < 1. Then, for any fV > 1, any N experts £ 
and for any y €Y* , 



T 

LAPA{f3){y) = '^n{yt) 

t=i 

• ^ \ 
i<t<N \ ^ ln(l//3 ) ) 

When it is required for the algorithm to make prediction on an outcome 
for every trial, the algorithm computes Sff{rt) by using the pseudoprediction 
rt and yields a genuine prediction. In this case the ith expert is also supposed 
to produce a genuine prediction, denoted for 1 < z < A^. In this way we 
have the aggregating algorithm ([8]) by replacing f*{y) in APA with X{y,fl), 
and replacing output rt in APA with output i/t := T'^(r). 



Algorithm 2 (AA(/3)) 
for z€ {!,..., A} do 
wj := l/N 
for t := 1, 2, . . . do 
receive 



for y G F do 

n{y) 



output yt ■= S/3(r) 

receive yt 

for z G {!,..., N} do 



The loss of the aggregating algorithm, denoted L^p^(^fj'^{y), is defined as 

T 

LAA{f}){y) = ^A(yt,r,3(rt)). 

t=i 



By the definitions of the mixability curve c(/3) and the substitution func- 
tion S/s, we immediately have an upper bound on the loss of the aggregating 
algorithm as follows. 
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T 

Lka(P){v) = 

T 

< c{P)'^n{yt) 

= c{P)Lj^pj^^/s){y), 

which, together with Theorem 1, establishes the following theorem. The 
above inequality follows from the fact that \{yt, S/sirt)) < c{(3)rt{yt)- By de- 
finition it is clear that \{yt, Sj 3 {rt)) < c{(i)rt{yt), which holds because the pseu- 
doprediction Tt easily shown to be a /3-mixture. 

Theorem 2. (Vovk, [8]) Let 0 < /3 < 1, and let c(/3) give the value of the 
mixability curve at j3. Then, for any N > 1, any N experts £ and for any 
y&Y*, 



T 

LAA{f3){y) = Yfsin)) 



< min 

l<i<N 



c{l3)L^{y) 



c(/3) lniV\ 

ln(l//3) ) ■ 



3 Applying the Aggregating Algorithm to Prune Decision 
Trees 

Among a variety of applications of the multiplicative weight-update prediction 
algorithm, is the problem of seeking for a “good” pruning of a given decision 
tree T. By a good pruning we mean a pruning that is “not much worse” than 
the best pruning of a given decision tree. When given a label function, denoted 
V, which associates each node of T with an output in Y, a pruned decision tree 
V can be naturally thought of as the expert who makes predictions. So we can 
enumerate all the pruned trees of a given decision tree and apply the AA with 
the pruned trees being taken as experts. 

By applying simply the AA for the collection of the experts it follows from 
Theorem 2 that the loss of this naive prediction algorithm, denoted Ajv, is at 
most 



Lam{v) < 



min 

PGPRUN(T) 



c{(i)L-py{y) 



2c(/3) |leaves(T) | In 2 
ln(l//3) 



for any y G Y*, where L-p y^y) denotes the loss of the pruning V when given 
V . Unfortunately this naive approach has a fatal drawback that we have to deal 
with the exponential number of the prunings of T. 
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Elaborating a data structure, Helmbold and Schapire constructed an efficient 
prediction algorithm for the absolute loss that works in a manner equivalent to 
the naive algorithm without enumerating prunings. The performance of their 
algorithm is given as follows: 



Theorem 3. (Helmbold and Schapire, [3]) In the case of the absolute loss, there 
exists a prediction algorithm A such that for any T, V and y G {0, 1}*, when 
given T and V as input, A makes predictions for y so that the loss is at most 



La{v) < 



min 

•PGPRUN(T) 



Lyyjy) ln(l//3) + |P|ln2 

21n(2/(l + /3)) 



where \V\ denotes the number of nodes of V minus |leaves(T) fl leaves(P)|. A 
generates a prediction at each trialt in time 0{\xt\). Moreover, the label function 
V may depend on t. 



The naive algorithm mentioned above considers each pruning of a decision 
tree as an expert. There is also a variety of choice as to what we think of as the 
experts. In the next section we consider as the experts agencies corresponding 
to blocks of a partition of the domain X, each making some fixed prediction 
for a block. In this section, we can consider the more complicated case that two 
kinds of mini-experts £u = {£uX,£u\} are put on each internal node m in a tree, 
where one makes the decision to throw away the subtree below the node u 
and the other makes the decision to hold the edges downward from u. Then 
choosing one of the two mini-experts at each internal node clearly amounts to 
specifying a subtree of the tree given first. So putting the APA at each inner 
node u and applying the APA recursively we may compute a weighted subtree 
that has nearly the best performance. 

To be more precise, let us introduce some notation about decision trees. Let 
A be a finite alphabet, |A| > 1. A template tree T over A is a rooted, |A|-ary 
tree. Thus we can identify each node of T with the sequence of symbols in A 
that forms a path from the root to that node. In particular, if a node u of T 
is represented by x G A* (or a prefix of x), then we will say that x reaches the 
node u. The leaf I that x reaches is denoted by I = leaf-p(x). Given a tree T, the 
set of its nodes and that of its leaves are denoted by nodes(T) and leaves(T), 
respectively. A string in A* that reaches any of the leaves of T is called an 
instance. 

As in the usual setting, an instance includes a path from the root to a leaf 
according to the outcomes of classification tests done at the internal nodes on 
the path. So, without loss of generality, we can identify an instance with the 
path it induces and thus we do not need to explicitly specify classification rules 
at the internal nodes of T. A label function V for template tree T is a function 
that maps the set of nodes of T to the prediction space V. A pruning V of the 
template tree T is a tree obtained by replacing zero or more of the internal nodes 
(and associated subtrees) of T by leaves. Note that T itself is a pruning of T as 
well. The pair (V, V) induces a pruned decision tree that makes its prediction 
V (leaf-p(x)) for instance x . The set of all pruning of T is denoted by PRUN(7~(). 
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Figure 1 shows an example of template tree T over alphabet S = {0, 1} and 
a pruning V oi 'T with Y = [0, 1]. The numbers associated with the nodes are 
the values of the label function V for the nodes. For example, the predictions of 
(T,V) and (P,V) for instance (101) are 0.6 and 0.2, respectively. 




Fig. 1. Examples of a template tree T and a pruning P of T with a label function. 



We shall explore how to use the APA in the previous section to construct an 
algorithm that seeks for a weighted combination of the mini-experts located at 
the inner nodes of a decision tree given so that the weighted combination of the 
mini-experts performs nearly as well as the best pruning of the decision tree. 

Our algorithm is in some sense a quite straightforward implementation of 
the dynamic programming. To explain how it works, we still need some more 
notations. Recall that each node of T is identified with the string in E* that 
forms the path from the root to that node. In particular, the root is specified 
by the empty string e. For node u G S*, let Tu denote the subtree of T rooted 
at u. For an outcome sequence y GY*, the loss suffered at u, denoted Lu(y), is 
defined as follows: 

Lu(y)= 

t:xt reaches u 

Then, for any pruning of 71,, the loss suffered by Vu, denoted L-p^{y), can 
be represented by the sum of Li{y) for all leaves I oiVu- In other words, we can 
write L-p^{y) = L^iy) if Vu consists of a single leaf u and L-p^(^y) = Ea^sL-p^^{y) 
otherwise. Here Vua is the subtree of Vu rooted at ua. Since the losses L-p^^{y) 
for a € E are independent of each other, we can minimize the loss L-p^{y) by 
minimizing each loss (y) independently. We therefore have for any internal 
node u of T 
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Since dynamic programming can be applied to solve the minimization problem 
of this type, we can efficiently compute G PRUN(T^) that minimize L-p^{y), 
which is the best pruning of T. But if we try to solve the minimization pro- 
blem based on the formula above in a straightforward way, we have to have the 
sequence of outcomes y = (yi, . . . , yx) ahead of time. 

In the rest of this section we try to construct an algorithm that solves the 
minimization problem in an on-line fashion by applying aggregating pseudo- 
algorithm recursively on the decision trees. As mentioned before, we associate 
two mini-experts Su = {iS’m.l, £u\} with each internal node m, one £ui. correspon- 
ding to making the node u a leaf and the other corresponding to making 
the node u an internal node. To combine these experts we apply the APA recur- 
sively, which is placed on each inner node of 'T : The APA at an inner node u, 
denoted APA„(/3), combines the pseudopredictions of the experts £ui. and £ui 
to obtain its own pseudoprediction r* , and pass it to the APA at the parent 
node of u. More precisely, when given an instance Xt that goes through u and 
ua, the first expert £u± generates V{u) and the second expert £ui generates 
i.e., the pseudoprediction made by APA„o(/3), the APA at node ua. Then, 
taking the weighted average of these pseudopredictions V (u) and according 
to the multiplicative weight-update rule (recall that the genuine prediction V (u) 
is regarded as a pseudoprediction), APAtj(/3) obtains the pseudoprediction at 
u. To obtain the genuine prediction yt, our algorithm applies the /3-substitution 
function to the pseudoprediction only at the root during every trial, that is, 
yt = in the internal nodes we combine not genuine predictions but pseu- 

dopredictions using the APA. 

We present below the prediction algorithm constructed this way, which we 
call Structured Weight-based Prediction algorithm (SWP(/3) for short). Here, 
path(xt) denotes the set of the nodes of T that Xt reaches. In other words, 
path(xt) is the set of the prefixes of Xt- For node u, |m| denotes the depth of u, 
i.e, the length of the path from the root to u. 

Algorithm 3 (APA„(/3)) 
procedure PSEUDOPRED(u, xt) 
if t6 G leaf(T) then 
for y GY do 

riiy) ■■= Hy,v{u)) 

else 

choose a G S such that ua G path(xt) 
ri^ := PSEUDOPRED (Ma,xt) 

for y G Y do 

riiy) := log^ 

return r* 

procedure UPDATE(m, t/i) 
if u G leaf(T) then 
return 
else 
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choose a € S such that ua G path(xt) 



%_L 

t+1 






p\{vt,v(i 



)) 



return 



:= KiP 



liVt) 



Algorithm 4 (algorithm SWP(/3)) 
for u G nodes(T)\leaves(T) do 
:= 1/2 
:= 1/2 

for t = 1, 2, ... do 
receive xt 

rl := PSEUDOPRED(e,a;t) 
yt ■■= Mrl) 

output ijt 
receive yt 
for u G path(xt) do 
UPDATE(M,j/t) 

Let the loss suffered by APA„(/3) be denoted by Lu{y)- That is, 

Lu{y) = ^ rl{yt). 

t:itGpath(tCt) 

Since the first expert £ux suffers the loss Lu{y) and the second expert £ui 
suffers the loss SaesLuaiy), Theorem 2 says that for any internal node u of T, 



Lu{y) < min 



^u(y)i ' Lua{y) 
aeS 



(ln2)/(ln(l//3)). 



By the similarity between the inequality above and the equation 



min 

■P„GPRUN(ru 



Lvuiy) = S T„(y), 



aes 



min ij-p 
■PuaGPRUN(r„a) 



Xy) 



for the minimization problem, we can roughly say that Lu{y) is not much larger 
than the loss of the best pruning of Tu- More precisely, applying the inequality 
recursively, we obtain the following upper bound on the loss L^(y) at the root: 



Uy) < Lr{y) + \V\{ln2)/ (HI/ P)) 

for any pruning V G PRUN(T). This is because every node of V that is not T's 
leaf gives an extra loss of (In 2)/(ln(l//3)). Recall that \P\ denotes the number of 
such nodes. Since in APA„(/3) is shown to be a /3-mixture for any u G nodes(T) 
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and hence \{yt, S/sirD) < c{(3)r^{yt) for any 1 < f < T and u G nodes(T), the 
cumulative loss of the SWP, denoted Aswp(/3)(2/)> is given as follows: 

T 

^swp(/3)(y) = X] 

T 

< c{P)^rl{yt) 

= c{(i)L^{y) 

<c{fi){Lv{y) + \V\{\n2)/{\n{l/m 

for any V £ PRUN(T) and y £ . Thus we have the following theorem. 

Theorem 4. ([7], cf. [6]) There exists a prediction algorithm A such that for any 
T, V and y £Y* , when T and V are given as input, A makes predictions for y 
so that loss is at most 



- pepS"n(t) ^ 

We now go further forward exploiting the idea of putting mini-experts at each 
node in a decision tree in order to make prediction. If we consider putting at 
each node the mini-experts deciding not only to throw away the subtree below a 
node, but also to make prediction on values in prediction space Y, then we may 
construct an algorithm to yield a pruning that is competitive with best pruning 
having the best prediction at its leaves. 

In fact, if Y is finite, it can be done by associating |F| -|- I experts with 
each internal node u, one predicting the value that the subtree below u predicts, 
and the others predicting different values in prediction space Y ; we also need 
to associate |y| experts with each leaf, each predicting different values in Y. 
We shall deal with infinite prediction space Y and give an algorithm that is 
competitive not only with the best pruning V but also with the best node labeling 
V. To do so, we assume that the node label function V is time invariant (V does 
not depend on t), and that the loss function A is taken to be the absolute loss 
function. Since the loss function is assumed to be the absolute loss function, the 
loss suffered at the node I (which is not necessarily a leaf) of T under a label 
function V, which is denoted by Liy{y), is given by 

Liy{y)= E 

t:xtYeaches I 



for an outcome sequence y = {yi, . . . , yr) £ {0, 1}^. Then the loss of a pruning 
V oiT and a label function V can be represented by 

Lvy{y) = X! ^hv{y)- 

/Gleaves('P) 
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Let I be a node of T. And for a label function V, let Vq and Vi denote the label 
functions obtained by replacing the value V{1) by 0 and 1, respectively (with 
the values of the other nodes unchanged). Then it is easy to see that, for any 

mm{Li^Vo{y),Li,Vi{y)} < Liy{y) 

holds. 

By applying the above inequality repeatedly, we have the next lemma which 
says that without loss of generality we can assume that the label function V 
takes only binary values in {0, 1}. 

Lemma 1. Let T he a template tree and let V he a label function from nodes(T) 
to [0, 1]. Then, for any y € {0, 1}* there exists a label function Vb from nodes(y) 
to {0, 1} such that 



Lry{y) > Lrysiy)- 

So we assume that Y = {0, 1}. Then since Y is finite we can apply our strategy 
for the decision tree with the weighted mini-experts placed at its nodes: two 
mini-experts {£q,£\\ placed at each leaf in leaves(T) and three mini-experts 
{£q,£i,£ux} placed at each inner node u in nodes(T), where £q and £i identically 
predict 0 and 1 , respectively, and £ui predicts the value that the subtree below u 
predicts. So it is clear that the modified algorithm achieves the same loss bounds 
as in Theorem 4 except that (ln2)|P| in the loss bound is replaced by 

(ln3)|P| -I- (ln2)|leaves(P) nleaves(T)|. 

Thus, since the mixability curve for the absolute loss is given by c(/3) = 
ln(l//3)/(21n(2/(l-|- /?))), we have established the following theorem in a similar 
way to the proof of Theorem 4. 

Theorem 5. ([7], cf.[6]) In the case of the absolute loss, there exists a prediction 
algorithm A such that for any T and y G {0, 1}*, when given T as input, A makes 
predictions for y so that the loss is at most 



LA(y) < min min 

PePRUN(T) V : nodes (r)-> [0,1] 

L-py{y) ln(l//3) -I- \V\ ln3 -I- |leaves(T) fl leaves(P)| ln2 
^ 21n(2/(l + /?)) ■ 

Finally we mention that, instead of searching for a good pruning with a 
good label function, we can seek for a good label function for the template tree 
to obtain the same mistake bound. In this case we can greatly simplify the 
algorithm putting the mini-experts only on the leaves of the template tree. 

Let pruning Vm of T and label function Vm be those that minimize L-py{y) 
for y G {0, 1}^. That is. 



Lvmymiy) 



min min L-p y(y). 

PGPRUN(T) V:nodes(r)->[0,1] 
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By Lemma 5 we can assume that Vm is a function from nodes(Pm) to {0, 1}. 
On the other hand, it is easy to see that, if we define the label function V* by 
V*{V) = Vm{l) for any leaf I' of T that is a descendant of I, then we have 

LT,v{y) = Lvrn,Vm{y)- 

Since the loss of a decision tree T for a label function V is represented as 

Lr.y{y) = X! 

ZGleaves(7”) 

and that the loss Liy{y) at leaf I is independent of the labels V{1') for leaves 
V ^ I, we have 



mm 

V : leaves (T) — 



{ 0 , 1 } 



Lr,’ 



'(v) = 12 

/Gleaves(7”) 

= Lr,v*{y) 









which says that, in order to find the best V*, it suffices to find the best values 
(0 or 1) at each leaf I of T independently. 

In this way the problem of minimizing L-p y{ij) over label function V and 
pruning V oi 'T can be reduced to the problem of minimizing L'j-y{y) over 
label function V. We will give an algorithm that solves the latter minimization 
problem. To solve the problem, we associate two mini-experts £ = {£o,£i} with 
each leaf I of T, the one £q who identically predicts 0 and the other £i who 
identically predicts 1, and then place the aggregating algorithm AA;(/3) at each 
leaf I which works for the instances that reach the leaf 1. 

Let a leaf / of T be fixed. Put the both initial weights of £q and £i to 1/2. 
Since the losses that these experts suffer at each trial are 0 or 1, the weights of the 
experts depend on the number of times when the mini-experts make mistakes. 
More precisely, denoting the weights of £q and £\ at the beginning of trial t by 
Wq and w\, respectively, these weights are given by 



and 




where 


= 


and 


Ko = 



w[ = 



Then the weighted average r of the predictions of the experts is given by 



r(0)=log^(/3"^‘M+/3"^col+i), 
r(l) =log^(/3l^*.il+i+/3l^'.ol), 
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where the re-normalization is skipped. Since the prediction output by the AA; 
only depends on the difference 



r( 0 ) - r(l) = log^ 



f3\Ki\ + /jl^ol+i 



p\Ki\-\Ko\+p 



the algorithm A A/ works well by only maintaining an integer a = — I^V/qI 

rather than the weights. Theorem 2 says that the loss of AA/ for y is given by 



LA,(y) < 



^i,v(y) +ln2 

mm — ^ 

V:U}^{o,i} 21 n( 2 /(l + /3)) 



The complete algorithm, denoted A*, that is competitive with the best pru- 
ning associated with the best label function is given as follows. 

Algorithm 5 (prediction algorithm A*) 

c:= (ln(l//3))/(21n(2/(l + /3))) 
for I G leaves (T) do 

ai := 0 

for t := 1, 2, . . . do 



receive xt 

I := lea,f-r{xt) 




output yt 
receive yt 



li yt = 1 then 

ai \= ai + l 

else 

ai := a/ - 1 
It is clear that 

LA-{y)= ^Aiiy) 

/Gleaves(7”) 

. .b; v(j/) ln(l//3) -h In 2 

21n(2/(l + /3)) 

L 7 - y (y) ln( 1//3) -I- |leaves(T) I In 2 

= mm ^ ; — y 

y:leaves(T)->{0,l} 2 ln(2/(l -|- /3)) 



Thus we have the following theorem. 

Theorem 6. Let the absolute loss function be assumed. For any T and for any 



LA*{y) < min 

PePRUN(T) 



Lp y (y) ln(l//3) -I- |leaves(T) I In 2 

mm 

y:nodes(r)->[0.1] 2 ln(2/(l -|- /3)) 
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Having constructed the prediction algorithm which is competitive with the 
best pruning as well as the best node labeling, you can see that the tree T 
could be thought of as specifying how to partition the instance space S* into 
subclasses and how to assign a prediction value to each subclass. Because our 
algorithm does not use the internal structure of T but only uses the subclasses 
defined by T, it can easily be generalized for any given rule that partitions the 
instance space into subclasses. 
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Abstract. In this paper, we study exact learning of logic programs 
from entailment and present a polynomial time algorithm to learn a rich 
class of logic programs that allow local variables and include many stan- 
dard programs like append , merge, split, delete, member, prefix, 
suffix, length, reverse, append/4 on lists, tree traversal programs 
on binary trees and addition, multiplication, exponentiation on 
natural numbers. Grafting a few aspects of incremental learning 0 onto 
the framework of learning from entailment |3], we generalize the existing 
results to allow local variables, which play an important role of sideways 
information passing in the paradigm of logic programming. 



1 Introduction 

Starting with the seminal work of Shapiro HMD, the problem of learning logic 
programs from examples and queries has attracted a lot of attention in the 
last fifteen years. Many techniques and systems for learning logic programs are 
developed and used in many applications. See El for a survey. In this paper, 
we consider the framework of learning from entailment [1-7,13,14] and present 
a polynomial time algorithm to learn a rich class of logic programs that allow 
local variables and include many standard programs from Sterling and Shapiro’s 
book [T^. 

Our work has been inspired by the recent work of Arimura PI presenting a 
polynomial time algorithm to learn a class of logic programs called acyclic con- 
strained Horn programs. This class includes an impressive set of standard pro- 
grams with recursion like append , merge, split, delete, member, prefix, 
suffix, length and add besides many nonrecursive programs. The main pro- 
perty of these programs is that all the terms in the body of a clause are subterms 
of the terms in the head. This means that local variables are not allowed. Ho- 
wever, local variables play an important role of sideways information passing in 
the paradigm of logic programming and there is an urgent need to extend the 
results for classes of programs which allow local variables. 



M.M. Richter et al. (Eds.): ALT’98, LNAI 1501, pp. 143-^^] 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 
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In this paper, we extend the results of Arimura 0 for one such class of pro- 
grams, using moding annotations and background knowledge. Our background 
knowledge is nothing but a logic program already learnt, perhaps using the fra- 
mework of learning from entailment itself. In other words, we graft a few aspects 
of incremental learning 0 to the framework of learning from entailment P| . To 
summerize the results of this paper, (1) a class of logic programs as background 
knowledge is identified together with (2) a class of logic programs (called finely- 
moded programs) learnable in polynomial time from entailment is introduced, 
(3) some results about the complexity of subsumption and entailment problem 
for these classes are obtained and (4) a learning algorithm is presented. We also 
prove that the class of finely-moded programs properly contains the class of 
acyclic constrained Horn programs. 

The rest of the paper is organized as follows. The next section gives preli- 
minary definitions and section 3 defines the class of finely-moded programs and 
proves some characteristic properties of them. Section 4 presents a few results 
about subsumption and entailment and section 5 presents the learning algorithm 
for finely-moded programs. Section 6 provides correctness proof of the learning 
algorithm and section 7 concludes with a discussion. 



2 Preliminaries 

Assuming that the reader is familiar with the basic terminology of first order 
logic and logic programming m, we use the first order logic language with a 
finite set II of predicate symbols and a finite set E of function symbols. The 
arity of a predicate/function symbol / is denoted by arity{f). Function symbols 
of arity zero are also called constants. The size of a term/atom/clause/program 
is defined as the number of (occurrences of) variables, predicate and function 
symbols in it. 

Definition 1 A mode m of an n-ary predicate p is a function from {1, • • • , n} to 
the set {in, out}. The sets in{p) = {j \ m{j) = in} and out{p) = {j \ m{j) = out} 
are the sets of input and output positions of p respectively. 

A moded program is a logic program with each predicate having a unique 
mode associated with it. In the following, p(s; t) denotes an atom with input 
terms s and output terms t. The set of varaibles occuring in t is denoted by 
Var{t). 

Definition 2 A definite clause 



Po(so; to) ^ Pi(si; ti), • • • ,pfc(sk; tk) 

A: > 0 is well-moded if (a) 17ar(to) C l/ar(so, ti, • • • , tk) and (b) Far(s;) C 
l^ar(so, ti, • • • , ti_i) for each i S [1,A:]. A logic program is well-moded if each 
clause in it is well-moded. 
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The class of well-moded programs is extensively studied in the literature and 
the following lemma is one of the well-known facts about well-moded programs. 



Lemma 1 Let P be a well-moded program and Q be the query ^ p{s; t) with 
ground input terms s. If there is an SLD-refutation of PU{Q} with 6 as computed 
answer substitution then t9 is ground as well. 



Definition 3 A predicate p defined in a well-moded program P is deterministic 
if ti = t -2 whenever P ^ p(s;ti) and P \= p(s;t 2 ) for any sequence of ground 
input terms s. 



In this paper, we only consider deterministic well-moded programs. With- 
out loss of generality, we assume that each predicate has at most one output 
position 0 



3 Finely-Moded Programs 

As mentioned in the introduction, our learning algorithm takes a logic program 
as background knowledge. In this section, we present our assumptions about the 
background knowledge and the class of finely- moded programs. 

Definition 4 Let program P be a background knowledge and f be a ground 
term. The dependent set Dsit) of t w.r.t. B is defined as 

1. t G Dsit), 

2. if u G Dsfi) and B |= p(s;u) for some predicate p in B and ground input 
terms s then every term in s is in Dgit) and 

3. if u G Dsit) then every subterm of u is in DB{t). 



The following lemma is useful in the sequel. 

Lemma 2 Let t be a ground term. Then Db{s) G_ DB^t) if s is a subterm oft 
or s G Dsit). 

Proof : Easy. □ 

^ A predicate symbol with fc > 1 output positions can be replaced by a predicate sym- 
bol with 1 output position (and same number of input positions) using a fe-tupling 
operator. An atom p(si, . . . ,Sn; t\, . . . ,tk) with k output positions will be replaced 
by the corresponding atom p'(si, . . . , s„; /(ti, . . . , p)) with 1 output position, where 
/ is a fresh function symbol of arity k. 
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Definition 5 A background knowledge B is regular if (a) for every ground term 
t, \DB{t)\ is bounded by a polynomial in the size of t, and (b) the size of the 
SLD-tree for any query i— A is bounded by a polynomial in the size of A. 



Example 1 Consider the following append program, 
moding: app(in,in, out). 

app([ ],Ys,Ys) ^ 

app([X|Xs],Ys,[X|Zs]) 4” app(Xs, Ys, Zs) 

This program is well-moded and deterministic. For a list L, D{L) is the set of 
sublists of L. The number of sublist^ of a list L of length n is (n + l)c 2 + 
which is of the order 0{n?). It is clear that the size of the SLD-tree for any query 
4— A is bounded by a polynomial (in fact, linear) in the size of A. Therefore, B 
is regular. □ 



Example 2 It is easy to verify that standard programs for multiplication 
and addition can be served as regular background knowledge. □ 

In this paper, we only deal with regular background knowledge and use B to 
denote the background knowledge under consideration. 

Now, we go about presenting the class of finely-moded programs. We partition 
n into Bq and 7Ti such that Bq contains the predicate symbols (say, fco) defined 
in B and B\ contains the rest of the predicate symbols (say, fci). We assume 
that maximum arity of a predicate symbol in B is k 2 - 

Definition 6 A well-moded clause 



PO (®0 5 to) 4 Pi(si, tl), * * * , Pn (Sn, tn ) 
n > 0 is finely-moded if there is an integer m € [1, n] such that 

1. predicate symbols pi, • • • ,pm are in B\ and Pm+i, ■ ■ ■ ,Pn are in Bq, 

2. for each i € every term in Si is a subterm of a term in Sq and U is 

either a subterm of tg or a (local) variable occurring in ■ ,Sn and 

3. for each z G [m -I- 1, n], is a subterm of to and every term in Sj is either a 
subterm of a term in Sq or a (local) variable in ti, - ■ ■ ,tm- 

Definition 7 A well-moded program P is finely-moded if each clause in it is 
finely-moded. 



^ Basically, a non-empty sublist of L can be identified by its two end-points. The 
number of possible ways of choosing two distinct points on a line with n -|- 1 points 
is (n -I- l)c 2 • Therefore, the number of sublists of a list L of length n is {n-\- l)o 2 + 1- 



Learning from Entailment of Logic Programs 147 



Example 3 The following program for multiplication is finely-moded w.r.t. the 
regular background knowledge about addition. 

moding: a (in, in, out) and m( in, in, out). 
a(0,Y,Y) ^ 

a(s(X),Y,s(Z)) ^ a(X,Y, Z) 
m(0,Y, 0) ^ 

m(s(X), Y, Z) ^ m(X, Y, Zl), a(Y, Zl, Z) 



Example 4 The following program for reverse is finely-moded w.r.t. the regular 
background knowledge about append-last. 

moding: app-last(in,in, out) and rev (in, out). 
app-last([ ], Y, [Y])<— 

app-last ( [X I Xs] , Y, [X|Zs]) app-last(Xs,Y, Zs) 

rev([],[]) ^ 

rev([X|Xs], Zs) -i— rev(Xs,Ys), app-last (Ys ,X, Zs) 



We present two characteristic theorems about finely-moded programs below. 
In view of the background knowledge, we adapt SLD-computations as follows. 

Definition 8 Let B be a regular background knowledge, P be a finely-moded 
program and Q be a query ^ p(s; t) with ground input terms s. An adapted 
SLD-derivation of PU {Q} is a sequence of queries Qo = Q, Qi, Q 2 , • ■ • such that 
each Qi, i > 0 satisfies one of the following: 

1. Qi-i is ^ Ai , . . . , A„, the predicate symbol of the selected atom Ai is in 
III, the head iJ of a clause H ^ Pi, ... , Bm in P unifies with Ai through 
a most general unifier a and Qi is 

^ PlCr, . . . , Bm<J, ^2(7, . . . , Ana. 

2. Qi-i is ^ Ai, . . . ,An, the predicate symbol of the selected atom Ai is in 
Bo, B 1= Aia and Qi is 

•<— A2a , . . . , Ana. 

An adapted SLD-derivation Qq, ... , Qn is called an adapted SLD-refutation if 
Qn is an empty query. The notion of an adapted SLD-tree is defined similarly. 

The following two theorems are characteristic facts about finely-moded pro- 
grams. 
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Theorem 1 Let B be a regular background knowledge, P be a finely-moded 
program and Q be a ground query i— p(s; t) with predicate p G Ui. Then every 
input term of any atom g(u; v) in an adapted SLD- derivation of P U {Q} is a 
subterm of a term in s if q G Ui. 

Proof : Induction on the length I of the adapted SLD-derivation. Use the fact 
that input terms of an atom with predicate in IIi in the body of a finely-moded 
clause are subterms of the input terms of its head. □ 

Theorem 2 Let B be a regular background knowledge, P be a finely-moded 
program and Q be a ground query ^ p(s; t) with predicate p G IIi. If q{u; v) is 
an atom in any adapted SLD -refutation of P U {Q} with answer substitution 9 
and q G LIq then v6 G DB{t). 

Proof : Induction on the length I of the adapted SLD-refutation. 

Basis : I = 1. There is nothing to prove in this case. 

Induction Hypothesis : Assume that the theorem holds for all SLD-refutations 
of length I < k. 

Induction Step : Now, we establish that it holds for I = k. Let po{so;to) G- 
Pi(si; ti), • • • ,Pn(s„; be the input clause used in the first resolution step. 
There are two cases: (1) all predicate symbols pi, • • • ,Pn are in Hi or (2) there 
is an m < n such that pi, . . . ,Pm are in Hi and Pm+i, ■ ■ ■ ,Pn are in Hq. 

Case (1): By the definition of finely-moded clauses, each term in Si, • • • ,s„ is a 
subterm of a term in Sq and each ti is a subterm of to- Since Q is a ground query, 
Sq 0 = s and to9 = t and hence each atom Pi(sj; ti)9 is ground. It is easy to see 
that each atom in the adapted SLD-refutation of P U {Q} is also an atom in an 
adapted SLD-refutation of P U Pi(sj; ti)9} for some i G [1, n]. The length of 
the adapted SLD-refutation of P U {<— Pi(sj; ti)9} is clearly less than k and by 
the induction hypothesis, v9 G DB{ti9) C DB{t) for each atom q{u;v) in any 
adapted SLD-refutation of P U Pi(si; ti)9} if g G Hq- 

Case (2): We have 2 subcases: m = 0 and m > 0. In the former subcase, there 
are no local variables and p, • • • , are subterms of to- Further, pi(si; ti)9, ■ ■ -, 
Pn(sn; tn)9 are the only atoms in the adapted SLD-refutation of P U {Q}. Since 
Q is a ground query, Sq9 = s and to9 = t. Hence, output terms of these atoms 
are subterms of t and therefore members of DB{t). 

Now, consider the subcase m > 0. By the definition of finely-moded clauses, 
tm-i-i, ■ ■ ■ ,tn are subterms of to and hence tm-i-i9, . . . , t„9 are subterms of to9 = 
t and therefore members of DBff). For each atom g(u; w) (not a member of 
Pm+i(sm+i; • ,Pn(sn; tn)^) in the adapted SLD-refutation of P U {Q} 

with q G Ho, it is clear that g(u; v)9 is an atom in an adapted SLD-refutation 
of P U {<— pi{sy,ti)9}, for some i G The length of the adapted SLD- 

refutation of P U Pi{si]ti)9} is clearly less than k and by the induction 
hypothesis, v9 G DB{ti9). Now, we prove that DB{ti9) C DB{t) for each i G 
[1, m]. By the definition of finely-moded clauses, ti, . . . ,tm are subterms of terms 
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in io,Sm+i, ■ • • If ti is a subterm of to then DB{ti9) C DB{to9) = DB{t). 
Consider the case that ti is a subterm of a term in s^+j for some j G [1, n — m]. 
Since B |= pm+j (sm+j; tm+j)0, it follows that each term in Sm+j^ is a member of 
DB{tm+jS) C DB{to9) = DB{t). Therefore, Db(U9) C DB(t) as is a subterm 
of a term in Sm+j- □ 

It may be noted that unlike Theorem P this Theorem does not hold for 
any arbitrary adapted SLD-derivation, but holds only for SLD-refutations. In 
particular, it does not hold if P ^ p(s; t)9. 



4 Subsumption and Entailment 

Definition 9 Let Ci and C 2 be clauses Hi ^ Bodyi and H 2 ^ Body 2 respec- 
tively. We say Ci subsumes C 2 and write Ci ^ C 2 if there exists a substitution 
9 such that H\9 = H 2 and Body\9 C Body 2 - 

Definition 10 A program Pi is a refinement of program P 2 , denoted by Pi C 
P 2 if (VCi G Pi)(3C2 G P 2 )C 2 ^ Cl- Further, Pi is a conservative refinement of 
P 2 if Pi is a refinement of P 2 and each C in P 2 has at most one C G Pi such 
that C>C. 

Definition 11 A program P entails a clause C, denoted by P ^ C, if C is a 
logical consequence of P. 

The relation between subsumption and entailment is discussed below. 

Definition 12 A derivation of a clause C from a program P is a finite sequence 
of clauses Ci , . . . , Cfc = C such that each Ci is either an instance of a clause in 
P or a resolvent of two clauses in Ci, . . . , Ci-i. If such a derivation exists, we 
write P \~d C. 

The following theorem is proved in Nienhuys-Cheng and de Wolf 113 - 

Theorem 3 ('Subsumption Theorem^ 

Let P be a program and (7 be a clause. Then P \= C if and only if one of the 
following holds: 

(1) C is a tautology or 

(2) there exists a clause D such that P \~d D and D subsumes C. 

When C is ground, the above theorem can be reformulated as follows. 

Theorem 4 Let P be a program and C be a ground clause A ^ Pi, • • • , P„. 
Then P \= C if and only if one of the following holds. 

(1) C is a tautology. 

(2) C is subsumed by a clause in P. 

(3) There is a minimal SLD-refutation of P' U A|, where 

P' = PU{P, ^ I *G [l,n]}. 
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Definition 13 An SLD-refutation is minimal if selected atoms are resolved 
with unit clauses whenever possible. 

Even though (2) is covered by (3) in the above theorem, we explicitly mention 
(2) in view of its importance in our learning algorithm. 

Lemma 3 If C\ and C 2 are two finely-moded clauses, C\ ^ C 2 is decidable in 
polynomial time over the sizes of Ci and C 2 . 

5 Learning Algorithm 

In this section, we present an algorithm Learn-FM for exact learning of termi- 
nating finely-moded programs from entailment using equivalence, subsumption 
and request-for-hint queries. The oracle (teacher) answers ‘yes’ to an entail- 
ment equivalence query EQUIV{H) if H is equivalent to the target program 
H*, i.e., H 1= H* and H* ^ H. Otherwise, it produces a ground atom A 
such that H* \= A but H ^ A or H* A but H \= A. A subsumption query 
SUBSUME{C) produces an answer ‘yes’ if the clause C is subsumed by a clause 
in H*, otherwise answer ‘no’. When (7 is a ground clause A ^ Bi, • • • , Bn such 
that H* 1= C, the request-for-hint query REQ{C) returns (1) an answer ‘subsu- 
med’ if C is subsumed by a clause in H*, otherwise returns (2) an atom (hint) B6 
in a minimal adapted SLD-refutation of E[' U { 1 — A} with answer substitution 
9 such that B9 ^ {Bi,- ■ ■ , Bn}, where E[' — Hi* U {Bi ^ | * G [1, n]}. 

Algorithm Learn-FM uses the notions of saturation and least general 

generalization. 

Definition 14 A clause C is a saturation of an example E w.r.t. a theory (pro- 
gram) H if and only if C is a reformulation of E w.r.t. H and C ^ C for every 
reformulation C of E w.r.t. H. A clause D is a reformulation of E w.r.t. H if 
and only if H A E H A D. 

We are concerned with finely-moded programs and clauses and define sa- 
turation of an example E = po(so;to) w.r.t. H as E <— ClosureniE), where 
ClosureniE) = iSi US '2 such that Si is the set of ground atoms {p(s; t) \ p £ Hi, 
each term in s is a subterm of a term in Sq and H ^ p(s; t)} and S 2 is the set of 
ground atoms {g(u; v) \ q £ Hq, u is a subterm of to, H ^ g(u; v) and each term 
in u is either a subterm of a term in Sq or an output term of an atom in 5'i}. 

Definition 15 Let Ci and C 2 be two finely-moded clauses Ai ■£- Bodyi and 
A 2 £- Body 2 respectively. The least general generalization C\ U C 2 of C\ and 
C 2 is defined as a finely-moded clause A £- Body such that (1) A = po(so;io) 
is the least general generalization of Ai and A 2 and Ai = Aai, i £ [1,2], (2) 
Body = Si U S 2 is the largest set of atoms such that (a) = {p(s; t) \ p £ Hi, 

p(s;t)cTi £ Bodyi, * G [1j 2], each term in s is a subterm of a term in Sq and 
t is either a subterm of to or a local variable} and (b) S 2 = {p{s',t) \ p G Ho, 
p{s;t)ai £ Bodyi, * G [1,2], t is a subterm of to and each term in s is either a 
subterm of a term in Sq or an output variable (local) of an atom in S'!}. 
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Now, we are in a position to present our algorithm Learn-FM. 



Procedure Learn-FM; 
begin H := B; 
while EQUIV{H) ^ ‘yes’ do 
begin A := EQUIV{H); 

C := A ■(— Closureff(A); 

while REQ{C) returns a hint B do C := B -(r- ClosureniB)', 

% This while loop exits when C is subsumed by a clause in iJ*. % 

C := Reduce(C'); 

if SUBSUME{C U D) returns ‘yes’ for some clause D G H then 
generalize E[ by replacing D with Reduce(C U D) 
else generalize H by adding C to El 

end; 

Return(_ff) 

end Learn-FM; 

Function Reduce(A ^ Body); 

% Removes irrelevant literals in the body of a clause. % 

begin 

for each atom B G Body do 

if SUBSUME{A g- {Body — {R})) then Body := {Body — {R}); 
Return(^ ^ Body) 

end Reduce; 

Remark: It may be noted the application of the above function Reduce (from 
Reddy and Tadepalli H3!) is not mandatory for the correctness of the algorithm 
Learn-FM, but it improves the efficiency. In particular, checking subsumption 
of reduced clauses is easier than that of non-reduced clauses. 



Lemma 4 If a clause C is subsumed by a clause in the target program H* then 
Reduce(C) = C'6 for some clause C in H* and a substitution 9. 



Example 5 We illustrate the working of Learn-FM by considering the stan- 
dard multiplication program given in Example El The program for addition 
is given as the background knowledge B. For presentation purposes, we con- 
sider counterexamples of small size. Learn-FM starts with iJ = B as the 
initial hypothesis and query EQUIV{H) returns a counterexample, say A = 
m(s(s(0)),s(s(0)),s(s(s(s(0))))). 

The inner while loop asks REQ{A G- Body), where Body = a(0,0,0), 
a(0,s(0) ,s(0)) , a(s(0) ,0,s(0)) , a(0,s(s(0)) ,s(s(0))) , a(s(s(0)),0, 
s(s(0))), a(s(0) ,s(0) ,s(s(0))) , a(s(0) ,s(s(0)) ,s(s(s(0)))) , 
a(s(s(0)) ,s(0) ,s(s(s(0)))) , a(s (s (0) ) , s (s (0) ) , s (s (s (s (0) ) ) ) ) . 
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This results in a hint m(0, s(s(0)), 0). Now, the inner while loop asks 
REQ{m{ 0 ,s{s{ 0 )), 0 ) ^ a(0, 0, 0)), which returns answer ‘subsumed’. 

The function Reduce is applied to the clause m(0, s(s(0)), 0) ^ a(0,0, 0) and 
the resulting clause Cq ■ m(0, s(s(0)), 0) ^ is added to H. 

The outer while loop asks EQUIV (H) and gets a counterexample, say Ai = 
m(s(0), s(s(0)), s(s(0))). The inner while loop asks REQ{Ai ^ Bodyi), where 
Bodyi = m(0,s(s(0)) ,0) , a(0,0,0), a(0,s(0) ,s(0)) , a(s (0) , 0 , s (0) ) , 
a(0,s(s(0)), s(s(0))), a(s(s(0)) ,0,s(s(0))) , a(s(0) ,s(0) ,s(s(0))). 
The query REQ{Ai i— Bodyi) returns answer ‘subsumed’ and Reduce(Ai i— 
Bodyi) is the clause Ci : m(s(0), s(s(0)), s(s(0)) ^ m(0, s(s(0)), 0), a(s(s(0)), 0, 
s(s(0))). The clause Ci is added to H as SUBSUMe\CqUCi) returns ‘no.’ 

The outer while loop asks EQUIV{H) and gets a counterexample, say 
A _2 = m(s(s(0)), s(s(0)), s(s(s(s(0))))). The inner while loop asks REQ{A2 ^ 
Body2), where Body2 = m(0,s(s(0)) ,0) , m(s(0), s(s(0)), s(s(0))), 
a(0,0,0) ,a(0,s(0) ,s(0)) ,a(0,s(s(0)) ,s(s(0))) , a(s (s (0) ) , 0 , s (s (0) ) ) 
a(s(0) ,0,s(0)) , a(s(0) ,s(0) ,s(s(0))) ,a(s(0) ,s(s(0)) ,s(s(s(0)))) , 
a(s(s(0)) ,s(0) ,s(s(s(0)))) ,a(s(s(0)) ,s(s(0)) ,s(s(s(s(0))))). 

The query REQ{A2 ^ Body2) returns answer ‘subsumed’ and Reduce(A.2 ^ 
Body2) is the clause m(s(s(0)), s(s(0)), s(s(s(s(0)))) ^ m(s(0), s(s(0)), s(s(0))), 
a(s(s(0)), s(s(0)), s(s(s(s(0))))). The Igg of this clause and Ci is 
C3 : m(s(X),s(s(0)),s(s(Z))) ^ m(X, s(s(0)), Zl), a(s(s(0)), Zl, s(s(Z))). The 

query SU B SU M E{C-i) returns answer ‘yes’ and clause Ci in El is replaced 
by C3. 

The outer while loop asks EQUIV{H) and gets a counterexample, say 
m(s(0), s(0), s(0)). The inner while loop asks REQ(m{s{0), s(0), s(0)) ^ a(0, 0, 0), 
a(0, s(0), s(0)), a(s(0), 0, s(0))), which returns a hint m(0, s(0), 0). Then inner 
while loop asks REQ{m{ 0 ,s{ 0 ), 0 ) c- a(0,0,0)), which returns answer ‘subsu- 
med’ and Reduce(m(0, s(0), 0) ^ a(0,0,0)) is m(0,s(0),0) The Igg of this 
clause and Cq is C4 : m(0,Y, 0) As SUBSUME{C4) returns answer ‘yes,’ 
clause Co in El is replaced by C4. 

The outer while loop asks EQUIV{H) and (say) gets the above counterex- 
ample m(s(0), s(0), s(0)). The inner while loop asks REQ(m{s{0), s(0), s(0)) ^ 
m(0, s(0), 0), a(0, 0, 0), a(0, s(0), s(0)), a(s(0), 0, s(0))), which returns answer ‘sub- 
sumed.’ An application of Reduce returns the clause m(s(0), s(0), s(0)) c— 
m(0, s(0), 0), a(s(0), 0, s(0)). The Igg of this clause and C3 is C5 : m(s(U), s(V), 
s(lT)) ^ m(U,s(V),Zl),a(s(V),Zl,s(W)). The query SUBSUME{Co) re- 
turns answer ‘yes’ and clause C3 in H is replaced by Cs. 

The outer while loop asks EQUIV{H) and gets a counterexample, say 
m(s(s(s(0))), 0, 0). The inner while loop asks REQ(m(s(s(s((lj)),0, 0) ^ m(0, 0, 0), 
a(0, 0, 0)), which returns a hint m(s(0), 0, 0). Then inner while loop asks 
i?if(5(m(s(0), 0, 0) ^ m(0, 0, 0), a(0, 0, 0)), which returns answer ‘subsumed’ and 
Reduce does not delete any literal but returns the same clause. The Igg of 
this clause and C5 is Cq : m(s(X),Y, Z) ^ m(X, Y, Zl), a(Y, Zl, Z). The query 
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SU B SU M E{Cq) returns answer ‘yes’ and clause C5 in H is replaced by Ce. 
The algorithm terminates as the query EQUIV{H) returns answer ‘yes’ and the 
final program learnt is the following. 

m(0,Y, 0) ^ 

m(s(X), Y, Z) ^ m(X, Y, Zl), a(Y, Zl, Z) □ 

6 Correctness of the Learning Algorithm 

First we prove that oracle answers all the three types of queries in polynomial 
time. 

Lemma 5 The query SUBSUME{C) can be answered in polynomial time over 
the size of the target program and the size of C. 

Proof : Follows from Lemma 0 □ 

Lemma 6 The query EQUIV {H) can be answered in polynomial time over the 
size of the target program and the size of Ef . 

Proof : This can be done by checking that each clause in H is subsumed by a 
clause in the target program H* and each clause in H* is subsumed by a clause 
in Ef. Each such subsumption check can be done in polynomial time and hence 
EQUIV{H) can be answered in polynomial time. □ 

Before proving that the query REQ{A •<— Closureu{A)) can be answered in 
polynomial time over the size of the target program and the size of A, we prove 
that the clause A -4— ClosureniA) can be constructed in polynomial time over 
the size of A. 

Lemma 7 If A is a ground atom then \ClosureH{A)\ is bounded by a polyno- 
mial over |yl|. 

Proof : Let n be the size of A. By definition, each input term of an atom p(- • •) G 
ClosureniA) with p G Eli is a subterm of an input term of A. The number of 
subterms of a term is bounded by its size. Since Ef is a, deterministic program, 
the sequence of input terms of an atom uniquely determines the output term. 
Therefore, the number of atoms p{- ■ •) G ClosureniA) with p G Eli is bounded 
by kin^^ , a polynomial in n. By definition, output term of an atom p{- ■ ■) G 
Closure}i{A) with p G Ef^ is a subterm of the output term of A. Since the back 
ground knowledge B is regular, the number of atoms p(u; v) such that B |= 
p(u; v) is bounded by a polynomial over the size of v. Therefore, \ClosureH{A)\ 
is bounded by a polynomial over the size of A. □ 

Lemma 8 Let A be either a positive example returned by an equivalence query 
or a hint returned by a request-for-hint query. The query REQ(A G- ClosureniA)) 
can be answered in polynomial time over the size of the target program and the 
size of A. 
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Proof : It is easy to see that H* |= A if A is a positive example or A is a 
hint returned by a request-for-hint query. Therefore, there is an adapted SLD- 
refutation of i7* U{i— A} with answer substitution 6 such that each atom p(u; v) 
in it satisfies (a) p S 7Ti and each term in u is a subterm of an input term of 
A or (b) p £ IIo and v9 G DB{t). Let S be the set of all atoms of the form 
p(u; v) such that H* ^ p(u; v) and either (1) p £ Ui and each term in u is a 
subterm of an input term of A or (2) p £ Uq and v £ Dsit). Note that each 
atom in S need not be there in any adapted SLD-refutation of H* U A}, 
though the converse holds. Since \DB{t)\ is bounded by a polynomial over |t|, 
it follows that [S'! is bounded by a polynomial over the size of A and we can 
compute S in polynomial time over the size of A in a bottom-up fashion. Now, 
we can construct an adapted SLD-refutation of H* U A} in polynomial time 
by resolving each selected atom with a ground instance of a clause whose body- 
atoms are all members of S'. □. 



Remark: It may be noted that the main idea of the above proof, namely, 
bottom-up construction of potential atoms in an SLD-refutation, can be used for 
proving Lemma 5 of Arimura |3| . That lemma is very important for the results of 
P], but the proof given there is wrong in the following respect. In their lemma, 
an attempt is made to prove that entailment problem iL |= C is polynomial 
solvable for acyclic constrained Horn programs. They Skolemize C = A £- Body 
by applying a Skolem substitution a, construct the set grounda{H) of all the 
ground clauses obtained from H by substituting for the variables in H arbitrary 
subterms of the head Aa and check whether groundfj{H) \= Ca or not. They 
incorrectly claim that size{grounda{H)) is bounded by a polynomial in size{H) 
and size{C). The upper bound on the number of variables in iL is a linear fun- 
ction over size{H) (it is easy to that a term of size n can have n — 1 variables). 
The number of subterms of the head Aa is bounded by size{Aa), which is again 
a linear function over size{C). Each variable can be substituted by any of these 
subterms. Therefore, size{groundu{H)) is bounded by size{CY''^‘^^^\ which is 
not a polynomial in size{H) but an exponential in size{H). 

We now proceed onto the correctness proof of Learn-FM. Let H* be the 
target program, Hq, Hi, - ■ ■ be the sequence of hypotheses proposed in the equi- 
valence queries and Ai, A 2 , • • • be the sequence of counterexamples returned by 
those queries. 

Theorem 5 For each j > 0, hypothesis Hi is a conservative refinement of H* 
and counterexample is positive. 

Proof : Proof by induction on i. For f = 0, is B and the theorem obviously 
holds. We prove the theorem holds for i = m if it holds for i = m — 1. Consider 
TO**' iteration of the main while loop. Since A is a positive counterexample, 
H* )= A and hence H* \= A ■£- ClosureHr^_Y^)- Each hint B is an atom in 
an adapted SLD-refutation of iJ' U A} where H' = H* U {B' £- \ B' £ 
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ClosureHm-ii^)}- By induction hypothesis, Hm-i is a conservative refinement 
of H* and hence H* |= B' for each B' G Closurejj^_^(A). Therefore each hint 
B is an atom in an adapted SLD-refutation of H* U A} and H* \= B G- 
ClosureniB). By definition, H ^ B for any hint B. That is, H* \= C and 
Hm-i ^ C for each clause C considered in this iteration, in particular for the 
clause C at the exit of the inner while loop. We have two cases: (a) there is 
a clause D G Hm-i such that C U H is subsumed by a clause C* G H* and 
Hm = Hm-i U {Reduce(C U D)} — D or (b) there is no such clause D and 
Hm = Hm-i U {C}. 

By hypothesis, Hm-i is a conservative refinement of H* and it is easy to 
see that Hm is a conservative refinement of H* in case (b). Consider case (a) 
now. Since Hm-i is a conservative refinement of H*, D is the unique clause in 
Hm-\ subsumed by C*. As Hm is obtained from Hm-i by replacing D with 
Reduce(C U Z?), it is clear that Hm is a conservative refinement of H* . Since 
each hypothesis is a refinement, each counterexample is positive. □ 

Now, we establish polynomial time complexity of the learning algorithm 

Learn-FM. 

Lemma 9 If (7 is a clause of size n, then the sequence C = Co ^ Ci ^ C 2 ^ • 

is of length no more than 2n. 

Proof : When Ci -< C^+i, one of the following holds: (1) size{Ci+i) = size{Ci) 
and |Far(Ci+i)| > \Var{Ci)\, i.e., a constant or an occurrence of a variable 
(which occurs in Ci more than once) is replaced by a new variable or (2) 
size{Ci+i) < size{Ci). The change (1) can occur at most n times as the number 
of variables in a clause is less than its size. The change (2) can occur at most n 
times as the size of any clause is positive. □ 

Theorem 6 For any counterexample A of size n, the inner while loop of Learn- 
FM iterates for no more than kin^^ (a polynomial in n) times. 

Proof : Since each hint B is an atom in an adapted SLD-refutation of H*U{g- A}, 
the input terms of B are subterms of the input terms of A by Theorem ^ 
There are at most kin^^ such atoms. As the target program H* is a terminating 
program, no atom B is returned as hint more than once. Therefore the inner 
while loop iterates for no more than kin^"^ times. □ 

Theorem 7 The algorithm Learn-FM exactly identifies any finely-moded pro- 
gram with m clauses in a polynomial time over m and n, where n is the size of 
the largest counterexample provided. 

Proof : Termination condition of the main while loop is H H*. Therefore 
Learn-FM exactly identifies the target program H* if Learn-FM terminates. 
Now, we prove that the number of iterations of the main while loop is bounded 
by a polynomial over m and n. 

By Theorem 0 iL is always a conservative refinement of H* and hence H 
has at most m clauses. The size of each clause in H is bounded by a polynomial 
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in n by Lemma Q Each iteration of the main while loop either adds a clause to 
H or generalizes a clause in H. By Lemma |H1 the number of times a clause can 
be generalized is bounded by twice the size of the clause. Therefore, the number 
of iterations of the main while loop is bounded by m.poly(n), where poly{n) is 
a polynomial in n. Each iteration takes polynomial time as (1) saturation and 
Igg are polynomial time computable, (2) each query is answered in polynomial 
time and (3) by Theorem the number of iterations of the inner while loop 
is bounded by a polynomial in n. Therefore, Learn-FM exactly identifies any 
finely-moded program with m clauses in a polynomial time over m and n. □ 



7 Discussion 

In this paper, we studied exact learning of logic programs from entailment and 
presented a polynomial time algorithm to learn a rich class of logic programs 
that allow local variables and include many standard programs from Sterling 
and Shapiro’s book CHI. 

The following theorem establishes that the class of acyclic constrained Horn 
programs is properly contained in the class of finely-moded programs and our 
main result is a generalization of the main result of 

Theorem 8 Every acyclic constrained Horn program is a terminating finely- 
moded program with respect to the moding in which every predicate has no 
output position. 

Proof : By acyclicity, every acyclic constrained Horn program is a terminating 
program. Now we prove that any acyclic constrained Horn clause po(so) ^ 
pi(si), • • • ,Pn(sn) is finely-moded. Only condition (2) of the definition of finely- 
moded clauses is relavent as no predicate has any output position. Condition (2) 
requires that each term in Si, • • • , is a subterm of a term in Sq and this is 
true of any acyclic constrained Horn clause. □ 

As in 12, we can replace request-for-hint queries with membership queries to 
learn a class of finely-moded programs whose termination can be proved using 
a particular well-founded ordering. 

Reddy and Tadepalli HH independently studied exact learning of logic pro- 
grams with local variables from entailment and introduced the class of acyclic 
Horn (AH) programs. The main restriction AH-programs is that each term oc- 
curring in the head of a clause is a subterm of a term in the body. This is a 
strong restriction from the programming point of view and excludes even simple 
programs like append and member. However, Reddy and Tadepalli PI argue that 
the class of acyclic Horn (AH) programs is quite useful for representing planning 
knowledge. Further, they do not need moding annotations. 

Acknowledgements: The authors wish to acknowledge the financial support 
from the Australian Research Council (ARC) under the ARC Large Grant 
Scheme (No. A49601783). 



Learning from Entailment of Logic Programs 157 



References 

1. D. Angluin (1988), Learning with hints, Proc. COLT’88, pp. 223-237. 

2. D. Angluin (1988), Queries and concept learning, Machine Learning 2, pp. 
319-342. 

3. H. Arimura (1997), Learning acyclic first-order Horn sentences from ent- 
ailment, Proc. ALT’97, Lecture Notes in Artificial intelligence 1316, pp. 
432-445. 

4. W. Cohen and H. Hirsh (1992), Learnability of description logics, Proc. 
COLT’92, pp. 116-127. 

5. S. Dzeroski, S. Muggleton and S. Russel (1992), PAC-learnability of deter- 
minate logic programs, Proc. of COLT’92, pp. 128-135. 

6. M. Frazier and L. Pitt (1993), Learning from entailment: an application to 
propositional Horn sentences, Proc. ICML’93, pp. 120-127. 

7. M. Frazier and L. Pitt (1994), CLASSIC learning, Proc. COLT’94, pp. 23-34. 

8. P. Idestam-Almquist (1996), Efficient induction of recursive definitions by 
structural analysis of saturations, pp. 192-205 in L. De Raedt (ed.). Advances 
in inductive logic programming, lOS Press. 

9. M.R.K. Krishna Rao (1995), Lncremental Learning of Logic Programs, Proc. 
of Algorithmic Learning Theory, ALT’95, LNCS 997, pp. 95-109. Revised 
version in Theoretical Computer Science special issue on ALT’95, Vol 185, 
193-213. 

10. J. W. Lloyd (1987), Foundations of Logic Programming, Springer-Verlag. 

11. S. Muggleton and L. De Raedt (1994), Lnductive logic programming: theory 
and methods, J. Logic Prog. 19/20, pp. 629-679. 

12. S.H. Nienhuys-Cheng and R. de Wolf (1995), The subsumption theorem for 
several forms of resolution. Tech. Rep. EUR-FEW-CS-96-14, Erasmus Uni- 
versity, Rotterdam. 

13. C.D. Page and A.M. Frish (1992), Ceneralization and learnability: a study 
of constrained atoms, in Muggleton (ed.) Inductive Logic programming, 
pp. 29-61. 

14. C. Reddy and P. Tadepalli (1998), Learning first order acyclic Horn pro- 
grams from entailment, to appear in Proc. of International Conference on 
Machine Learning, ICML’98. 

15. C. Rouveirol (1992), Extensions of inversion of resolution applied to theory 
completion, in Muggleton (ed.) Inductive Logic progremiming, pp. 63-92. 

16. E. Shapiro (1981), Lnductive inference of theories from facts. Tech. Rep., 
Yale Univ. 

17. E. Shapiro (1983), Algorithmic Program Debugging, MIT Press. 

18. L. Sterling and E. Shapiro (1994), The Art of Prolog, MIT Press. 




Logical Aspects of Several Bottom-Up Fittings 



Akihiro Yamamoto 

Division of Electronics and Information Engineering 
and 

Meme Media Laboratory 
Hokkaido University 
N 13 W 8, Kita-ku 
Sapporo 060-8628 JAPAN 
yamamotoOmeme .hokudai .ac.jp 



Abstract. This research is aimed at giving a bridge between the two 
research areas, Inductive Logic Programming and Computational Lear- 
ning. We focus our attention on four fittings (learning methods) invented 
in the two areas: Saturant Generalization, ^‘-operation with Generaliza- 
tion, Bottom Generalization, and Inverse Entailment. Firstly we show 
that each of them can be represented as an instance of a common schema. 
Secondly we compare the four fittings. By modifying Jung’s result, we 
show that all definite hypotheses derived by ^‘-operation with Gene- 
ralization can be derived by Bottom Generalization and vice versa, but 
that some hypotheses cannot be derived by Saturant Generalization. We 
also give a hypotheses of a general clause which can be derived Bottom 
Generalization but not by ^‘-operation with Generalization. We show 
Inverse Entailment is more powerful than other three fittings both in defi- 
nite and in general clausal logic. In our papers presented at the UGALQ? 
workshops and the 7th ILP workshop, Bottom Generalization was called 
“Inverse Entailment,” but after the workshops we found it differs from 
Muggleton’s original Inverse Entailment. We renamed it “Bottom Gene- 
ralization” in order to reduce confusion and allow fair comparison of the 
htting to others. 



1 Introduction 

This research is aimed at giving a bridge between the two research areas, Induc- 
tive Logic Programming and Computational Learning. 

We are giving logical foundations of systems which learn clausal theories 
from entailment. Such a system M works as follows: At first M is given an 
initial background theory, and takes many positive examples as its input one by 
one. The background theory is a clausal theory (a conjunction of clauses) and 
each positive example is a clause. The system revises its current background 
theory B whenever B ^ E. One of the revision operations is to add a clause 
called a hypothesis to B. A hypothesis H is said to be correct if B A H \= E. A 
method to find correct hypotheses with is called a fitting . 
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In this paper we clarify the relationship among four fittings developed in 
the area of ILP (Inductive Logic Programming): Saturant Generalization, V*- 
operation with Generalization, Bottom Generalization, and Inverse Entailment. 
Firstly we show that each of these can be represented as an instance of a common 
schema for various fitting procedures. According to this result, we can compare 
the fittings by comparing the sets which consists of correct hypotheses generated 
by each fitting. 

Saturant Generalization was developed by Rouveirol unnni, in the ILP com- 
munity, in order to find hypotheses which cannot be found with Muggleton’s V- 
operator She assumed that every background theory should be a definite 

logic program, and that every example should be a definite clause. Saturant Ge- 
neralization can generate only definite clauses as hypotheses. The V* -operation 
was proposed by Muggleton in his invited talk at the first ALT workshop 0- He 
presented a fitting by combining it with the inverse of subsumption, and Jung 0 
showed that the fitting is an extension of Saturant Generalization. Muggleton 
defined the fitting without the assumption for Saturant Generalization, and allo- 
wed non-definite clauses to be derived as hypotheses. Later he proposed another 
fitting Inverse Entailment, which is again an extension of Saturant Generaliza- 
tion. While we were analyzing Inverse Entailment, we happened to find a new 
fitting Bottom Generalization m- It can be regard another extension of Saturant 
Generalization. Bottom Generalization as well as Inverse Entailment allows non- 
definite clauses to appear in background theories, examples, and hypotheses. 

In the GOLT communities, Angluin et al.p^ invented Saturant Generalization 
in their algorithm which learns propositional definite programs from entailment 
efficiently. Since they did not use the word “saturant,” we should conjecture 
that they invented Saturant Generalization without referring Rouveirol’s work. 
Arimura | 2 | developed an algorithm learning first-order definite clauses, based 
on his analysis of Angluin’s algorithm. He presented the algorithm at the last 
ALT workshop. Gohen m analyzed the complexity of Saturant Generalization 
in the first-order definite clause logic, by using the PAG-learning framework. 

Now we must notice that a quite interesting problem has left unsolved: Gan 
we develop any algorithm which learns non-definite clausal theories from entail- 
ment, by using the three extensions of Saturant Generalization? In this paper 
we give the first step for the solutions to this problem. The result that the four 
fittings can be regarded as instances of a common shama means that we may 
use the three fittings as well as Saturant Generalization for learning from entail- 
ment. The comparison of the four shows, from a logical viewpoint, how the three 
extensions are more complicated than Saturant Generalization. This information 
should be helpful when we design the intended learning algorithms. 

This paper is organized as follows: After preparing some terminology in logic 
in Section 2, we show in Section 3 that each of the four fitting can be represented 
as an instance of a procedure BU-FIT-MAIN. In Section 4 we compared each 
of them with others, with referring some previous results given by Jung and 
by ourselves In Section 5, we give some additional comments to our 

discussion. 
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In our papers presented at the IJCAI’97 workshops and the 7th ILP works- 
hop, our fitting was called “Inverse Entailment,” but after the workshops we 
found it differs from Muggleton’s original Inverse Entailment. We renamed it 
“Bottom Generalization” in order to reduce confusion and allow fair comparison 
of our fitting to others. 

2 Preliminaries 

We assume that readers are familiar with first-order logic and logic programming. 
When more precise definitions are needed, they should consult some textbooks 
on the areas(e.g. Enni). 

Let £ be a first-order language. For each variable x we prepare a new constant 
symbol Cx called the Skolem constant of x. We let denote the language whose 
alphabet is obtained by adding all the Skolem constants to that of £. 

A clausal theory is a finite conjunction of clauses. In this paper a clause is a 
formula of the form 

C = Va;i . . . Xk{A\ V A 2 V ... V A„ V ~'Bi V — 1 B 2 V ... V —^Bm) 

where n > 0, m > 0, A^’s and Bj's are all atoms, and x\, . . . ,Xk are all variables 
occurring in the atoms. We represent the clause C in the form of implication: 

Al, A 2 , ■ ■ ■ , An t— i?i , i?2 , • ■ • ) Bjn . 

The complement of A is a clausal theory 

-'(C'crc) = (“'Al A - 1 A 2 A ... A —'Am A i?i A B 2 A . . . A Bm)<yc 

where ac is a substitution which replaces each variable in C with its Skolem 
constant. We will write the substitution with a if it makes no ambiguity. 

A definite clause is a clause of the form Ag •<— Ai, . . . , A„. A clausal theory 
consisting of definite clauses is called a definite program. 

Definition 1. A clause D subsumes a clause C if there is a substitution 9 such 
that every literal in D6 occurs in C. 

We apply this definition to the case when C is a (possibly infinite) set of literals. 
Note that D subsumes C iff there are a clause F and a substitution 9 such that 
every literal in F belongs to C and D9 — F. Therefore to make a clause D 
subsuming C is to apply the inverse of instantiation to a clause made of some 
literals in C. 

For a first-order language £, we introduce some notations: GL(£) for the set 
of all ground literals in £, GT(£) for the set of all clausal theories in £, G(£) 
for the set of all clauses in £, DP(£) for the set of all definite programs in £, 
and D(£) for the set of all definite clauses in £. For a set S of literals, 5'+ {S~) 
denotes the set of all positive (negative, resp.) literals in S. The set GL(£) + , 
which is the Herhrand base of £, is denoted by HB(£). 
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3 Bottom-Up Fittings 

In the theory of ILP three languages should be distinguished: the language £b 
for background knowledge, the observational language £e, and the hypothesis 
language £h- Each element of £b is called a background theory, each of £e 
is called an example, and each of £h is called a hypothesis. The tuple S = 
(£b,£e 7 £h) is called a language structure for an ILP theory. The language 
structure Sq = (CT(£), C(£), C(£)) is called the general structure of £, and the 
structure 5 d = (DP(£), D(£), D(£)) is the definite structure of £. 

Definition 2. Let 5 be a language structure. A fitting procedure (or a fitting, 
for short) .£ is a procedure which generates hypotheses H from a given example 
E with the support of a background theory B. The set of all such hypotheses is 
denoted by T{E, B). 

The fittings we are now discussing can be represent one main routine and 
two sub-procedures. The first sub-procedure derives a highly specific clause and 
the second generalizes it. We give formal definitions. 

Definition 3. A base enumerator A is a procedure which takes an example E 
and a background theory B as its input and enumerates a subset of GL(£^). 
This subset is denoted by A{E, B) and called a base set. 

Definition 4. A generalizer E takes a ground clause F in C(£'*) and generates 
clauses in £h. The set of all clauses generated by E is denoted by E{F). 



Procedure BU-FIT-MAIN^,r(£, -B) 

1. Choose non-deterministically literals £i, . . . , £„ from A{E, B). 

2. Return non-deterministically clauses in £(£i V . . . V L„). 

If at least one of the sets A{E,B) and E{F) is infinite, we must adopt some do- 
vetailing method in order to enumerate all elements in the sets. In our discussion 
we need not mind how the dovetailing method is implemented. 

The followings are fittings we analyze logical aspects in this paper. 



3.1 Saturant Generalization 

We define Saturant Generalization as an instance of BU-FIT-MAIN, which is 
denoted by SATG. 

At first, for a definite clause C = A ^ Bx, B^, ... , B^, we define clauses G+ 
and C~ as follows: 



C+ = A ^ , 

C = ^ Bi, B 2 , . . . , B^ . 
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Definition 5. The saturant of a definite clause E w.r.t. a definite program B 
is defined as 

Satu(£:, B) = {E+a} U {^A \ A € HB(£) and B A -n(E~a) h A} . 

The generalizer for SATG is the inverse of instantiation. We formally define 
the generalizer in the form of a set 

Ilns(AT) = {C I C9 = K for some substitution 0} . 

The clause C in Ilns(AT) is usually called generalization of K. 

We put 

SATG = BU-FIT-MAINsatu,iins • 

The set of the hypotheses generated by SATG is represented as 
SATG{E, B) = {iJ I EI9 subsumes Satu(A, B)} . 



3.2 V*-Operation with Generalization 

As proposed by Muggleton Pj and Jung pj, the repeated use of Muggleton’s most 
specific V-operation {MSV- operation for short) can be regarded as generating a 
base set. 

The original definition of MSV-operator in 0 is in a operational form and 
is not convenient for discussing the derivability of hypotheses. With focusing on 
the results of the operator, we give a new definition suitable to our discussion. 

Definition 6. Let D and E be clauses. A clause C is V-derivable from E with 
the support of D if there is a substitution 9 and a literal L such that 

1. L is in a factor D' of D, 

2. C = Ey ^L9, and 

3. every literal in {D' — {L})9 occurs in E. 

If D is a ground instance of a set of clauses B, we say C is V-derivable from E 
with the support of B. 

Next we inductively define a set V^{E, B) for every non-negative integer n\ 

V\E,B) = {Eoe\ , 






El is V-derivable from a clause E G V"{E,B) 
with the support of B 

(n > 0) . 

By using the sets V^{E, B) {n = 0, 1, 2, . . .), we define a base set 
Vbot(A, B) = {L G GL(£'’) | L occurs in some H G V"(E, B) for some n} . 
and a fitting 

VNG = BU-FIT-MAINvbot.ii„s • 

From the definition it holds that 



VNG(A, i?) = {H I H9 subsumes Vbot(A, i?)} . 
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3.3 Bottom Generalization 

The base set of Bottom Generalization is the bottom set Bot(if,i?), which was 
originally introduced by Muggleton m in an informal manner. We give its 
formal definition. 

Definition 7. Let B he a background theory and E be an example. The bottom 
set{oT bottom, for short) for E w.r.t. B is a set of literals 

Bot(L;, B) = {Lg GL(£®) I B A ~^{Ea) h ~^L} . 

We adopt the inverse of the subsumption as the generalizer for BOTG, that is, 
we define 

BOTG = BU-FIT-MAINBot.iins, 

and therefore, 

BOTG{E,B) = {H €Cu\H subsumes Bot{E,B)} . 

From the note on the subsumption relation in Section 2, the generalizer can be 
a procedure for inverting instantiation. 



3.4 Inverse Entailment 

A fitting INVE for Inverse Entailment mg is defined as 
INVE = BU-FIT-MAINBot.iimp , 



where limp is a procedure for inverting logical implication of clauses. 
The set of the hypotheses derived by INVE is represented as 



INVE(E, B) 



HgCi 



There is a clause K such that H \= K 
and K consists of literals in Bot{E,B) 



The set 

Ilmp(F) = {H € C{£) \H\=E} 

could be recursively enumerable with some generate-and-test method. We do 
not mind how efficiently generate the elements of the set. 



4 Logical Aspects of the Fittings 

4.1 Correctness and Completeness of Fittings 

We give the formal definition of the correctness of hypotheses generated by a 
fitting according to our previous works m- 

Definition 8. Let B be a background theory and E an example. A hypothesis 
H is correct for E w.r.t. B if B A H is consistent and B A H \= E. 
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Definition 9. A fitting procedure T is correct in a language structure S = 
(£b, ^E, ^h) if S T{E, B) implies BAH |= A for any pair of B G £b and 
E G £e- 

Generally speaking, the inference power of a fitting T is analyzed in two ways: 
The first one is to find a structure S in which every H such that B A H E 
must be in T(E^ B). The second approach is to find a relation among B, E, and 
H which is equivalent to H G E{E,B) in the structure S. This means that we 
show the completeness of T relative to the relation. 

According to the second approach, we define the completeness of fittings 
relative to a generalization relation. 

Definition 10. Let B and H. be sets of formulas. A ternary relation ^G BxT~Lx 
H is a generalization relation on H. parameterized with B it >: (S, H, E) implies 
B A H \= E. In the followings we write H > E (B) instead of ^ {B, H, E). 

Definition 11. A fitting T is complete in a language structure S w.r.t. a gene- 
ralization relation ^ if iF is correct and every hypothesis H such that H > E (B) 
can be derived from E w.r.t. B with T whenever B ^ E. 

From the definition, a relation F/ defined as 



is a generalization relation. We call it the relative implication relation. 

Both Plotkin’s relative subsumption and Buntine’s generalized subsumption 
are also generalization relations m- 

Definition 12 (Plotkin ill 41 i . Let H and E be two clauses. H subsumes E 
relative to B if there is a clause F such that 



and H subsumes F, where E' and F' are obtained by removing universal quan- 
tifiers from E and F respectively, and yi, . . . ,yn are all variables occurring in 
E' and F'. 

Definition 13. Let A be a ground atom and I be an Herbrand interpretation. 
A definite clause Aq •<— Ai , . . . , Am covers A in / if there is a substitution 9 such 
that AqO = A and AiO is true in I for every i = 1, . . . , n. 

Definition 14 (Buntine j31)- Let H and E be two definite clauses. H subsu- 
mes E w.r.t. B if, for any Herbrand model M oi B and for any ground atom A, 
H covers Ain M whenever E covers A. 

When H subsumes E relative (w.r.t.) B, we write H Fp E (B) ( H Fb E (B), 
resp.). To demonstrate the main result visually, we introduce the following two 
sets of hypotheses: 




B^yyi...y„{E' gaF') 



hp{E,P) = {H\HhpE{B)} , 
hB {E,B) = {H\HhBE{B)} . 
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4.2 Results 

Now we demonstrate the current analysis of correctness and completeness of the 
fittings explained in Section 3. 

Theorem 1 (Correctness). The fitting SATG is correct in Su, and all of 
VNG, INVE, and BOTG are as well in Sq. 

Theorem 2 (Completeness). Let us assume the structure iSd- Then, for every 
background theory B and ever example E such that B E, the followings hold 
in general: 

SATG(£;, B) = hB (E, B) p VNG(A, B) = BOTG(A, B) = Ap (A, B) 

p INVE(£;,R) . 

If Sq is assumed, the followings hold in general for every B and every E such 
that B E: 

VNG(E, B) S BOTG(E, B) = hp {E, B) S INVE(E, B) . 



[1] Proof of the Correctness Theorem 

Since we will prove the completeness theorem, all that we must show is the next 
lemma. 

Lemma 1. The procedure INVE is correct in the general structure Sq. 

Proof. Let H he & hypothesis in INVE(E, R) and 

F = Ai, A 2 , . ■ . , An •<— Ri , i?2 , • ■ ■ , Bm 

be a clause such that every literal in F belongs to Bot(E, B) and H \= F. 

Since the clause F is ground, no quantifiers are occurring in -iF, that is, 

—'F = {—'Ai A — 1 A 2 A ... A ~iAm A Bi A B 2 A ... A Bm)0 . 

From the definition of the bottom set, it holds that B A ~'{Ea) |= -iF. Since all 
the literals in ^{Ea) and ^F are ground, it is equivalent to an assertion that 
B A F6 ^ Fa. Because H \= F and a is the Skolemizing substitution for E, we 
obtain that B f\ H \= E. □ 

[2] Proof of the Completeness Theorem 

We divide the Completeness Theorem into several Lemmas and give their proofs. 

The completeness of Saturant Generalization was proved by Jung, and that 
of Bottom Generalization was proved in our previous work. 

Lemma 2 (Jung |B]). SATG is complete w.r.t. Buntine’s generalized subsump- 
tion in 5d . 
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Lemma 3 (Yamamoto||l9| L BOTG is complete w.r.t. Plotkin’s relative sub- 
sumption in Sq. 

Because H E{B) implies H Yp E{B) but the converse does not hold in 
general j2j, we have the next lemma. 

Lemma 4. In the definite structure Sd, BOTG(i?, B) D SATG(i?, B) if B 
E. However, SATG{E,B) 2 BOTG(i?,G) in general. 

We gave in jECHEI more precise discussion on the comparison of BOTG and 
SATG. 

If a clause H subsumes another E, it also logically implies E. We have already 
shown in pn| that H E{B) is not equivalent to H G INVE(A, B) in Sq. By 
combining the two results, we obtain the following: 

Lemma 5. In the structure Sq, B0TG{E, B) C INVE(if, B) if B E. The 
inclusion holds in the 5d as well. 

The problem whether or not INVE(£', B) = BOTG(B, B) has not been sol- 
ved yet. The following example shows that the answer is negative. 

Example 1. Let Ei = p(s(s( 0 )) i— p( 0 ) and Bi = 0 . The hypotheses 

Hi =p{s{x)) ^p(x) 

logically implies Ei, but does not subsume it. Therefore we conclude 
INVE(Bi,Bi) -BOTG(Bi,Bi) 0 . 

Now we compare VNG with BOTG. As Jung showed, applying E-operator 
iteratively is inverting linear input resolution. With the completeness of linear 
input resolution for definite programs, the following two propositions hold. 

Lemma 6 (Jung [ 6 j). It holds that, in So, VNG(B, B) = {E,B) if B Y= E. 

It is also obtained that VNG(B, B) D ^p (B, B) in Sq, but the problem whether 
or not VNG(B, B) = ^p (B, B) has been open. The following example shows 
that BOTG(B, B) Y=- VNG(B, G) in 5 g, and consequently that BOTG is superior 
to VNG in the point its completeness holds in a wider language structure. 

Example 2. Gonsider the following E^ and B2: 



B2 = p ^ , 

■82 = {p,q^r)A{pG- q, r) A {q G- p, r) A {g- p, q, r) . 

If any clause is V-derivable from B2 with the support of a clause in B2, it must 
have a factor of the form 

p,r G- 

because every clause in B2 has ->r but B2 has no r in it. However, we can- 
not derive B2 from any pair of such a clause and a clause in B4. Therefore 
Vbot(B2, B2) = {p}. On the other hand, we can easily show that Bot(B2, B2) = 
{p,r}. This concludes that Bot{E2, B2) yf Vbot(B2,B2). 
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5 Concluding Remarks 



Jung defined of the set V^^^{E,B) as follows: 



V^+^{E,B) = < 


H G c(/:") 


H is V-derivable from a clause F G V^{E,B) 'j 
with the support of an instance D of . 




1 


a clause in R J 



(n > 0) . 



In his definition hypotheses in B) are not always ground even if V^{E, B) 

contains ground clauses only. To keep all the clauses in V^{E,B) ground, he 
assumed that the background theory B should be a strongly generative logic 
program. We need not assume that strong generativeness because we restricted 
the supporting clause D as & ground clause. 

In our personal communication with Muggleton El, he gave a conjecture 
that E[ G INVF,{E,B) is equivalent to H E. The conjecture is equivalent to 
INVE(if, B) = BOTG(if, B), and has been refuted by our result. 

Now we are at the point where our result to analyzing computational aspects 
of learning. For example, it is in our future plans to investigate how to use Bottom 
Generalization in extending the algorithms based on Saturant Generalization 
presented at GOLT and ALT workshops. 
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Abstract. One of the most important issues in machine translations 
is deducing unknown rules from pairs of input-output sentences. Since 
the translations are expressed by elementary formal systems (EFS’s, for 
short), we formalize learning translations as the process of guessing an 
unknown EFS from pairs of input-output sentences. In this paper, we 
propose a class of EFS’s called linearly-moded EFS’s by introducing lo- 
cal variables and linear predicate inequalities based on mode information, 
which can express translations of context-sensitive languages. We show 
that, for a given input sentence, the set of all output sentences is finite 
and computable in a translation defined by a linearly-moded EFS. Fi- 
nally, we show that the class of translations defined by linearly-moded 
EFS’s is learnable under the condition that the number of clauses in an 
EFS and the length of the clause are bounded by some constant. 



1 Introduction 



In machine translation, many formal systems have been developed to trans- 
late a language into another. One well-known system, syntax- directed translation 
scheme (SDTS, for short) pil2l8IH| . has sufficient power to express the relations of 
two context-free languages. The SDTS has been investigated from the viewpoint 
of designing compilers for programming languages. On the other hand, the ex- 
pressive power of SDTSs is insufficient to deal with more complicated languages, 
such as context-sensitive or natural languages. 

One of the most important issues in machine translations is deducing un- 
known rules from pairs of input-output sentences. This issue can be formalized 
in the framework of learning binary relations from strings. In this paper, we ad- 
opt an elementary formal systemfEFS, for short) jbl 1 4lj instead of an SDTS, and 
discuss the learnability of translations over context-sensitive languages. The EFS 
is flexible enough to define various classes of formal languages corresponding to 
Chomsky hierarchy, and is an adequate tool for learning formal languages ISIEI- 
An EFS is a kind of logic program on strings, and consists of the clauses of 
the form At- Bi,. . . ,Bn as Prolog programs. Here, the sequence B^,. .. ,B^ 
is called a body of the clause. In EFS’s, we deal with strings as the argument’s 
terms of a predicate symbol. Specifically, the binary relation of a translation 
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can be expressed by the binary predicate symbol in an EPS. Therefore, we can 
formalize the learning translations as the process of guessing an unknown EPS, 
which defines a translation, from pairs of input-output sentences. 

In general, a variable is said to be local if it occurs only in the body of a 
clause. An EES is generally defined as a finite set of clauses without having local 
variables. While this definition is sufficient for us to regard EFS’s as acceptors of 
formal languages, it is not useful in terms of translations. For example, suppose 
that two translations are defined by EFS’s with binary predicate symbols qi 
and Q2, and consider composing them. If we deal with standard EFS’s, we need 
to generate a new EES representing the composition. On the other hand, if we 
deal with EFS’s with local variables, we can define the composition as a simple 
clause q{x,z) ■<— q\{x , y) , q2{y , z) . Clearly, unconditionally introducing local va- 
riables does not preserve the computability of computations for EFS’s. Hence, 
in this paper, we extend the form of EFS’s by introducing local variables while 
preserving the computability of computations from input sentences to output 
sentences. 

It is not surprising that there are no so-called negative examples in learning 
translations from examples, because it is difficult to obtain negative and mea- 
ningful examples in any translation. Therefore, we discuss learning translations 
from only positive examples. The learning in this setting is applied to various tar- 
gets In particular, Arimura and Shinohara jSj have shown that the 

class of linearly covering programs , which is a useful subclass of logic programs 
with local variables, is learnable from positive examples. The linearly covering 
program is defined by the information of an input-output mode. Furthermore, 
Rao m has extended the class by using linear predicate inequalities , which is 
so large to express standard Prolog programs as quick-sort or merge-sort. 

In this paper, we present a class of EFS’s, called linearly-moded EFS’s. The 
linearly- moded EFS’s are EFS’s with local variables and linear predicate ine- 
qualities. By the linearly-moded EFS’s, we can define translations over context- 
sensitive languages. Then, we show that, given an input sentence, the set of 
all output sentences is finite and computable on the translation defined by a 
linearly-moded EES. Furthermore, we show that the class of translations de- 
fined by linearly-moded EFS’s is learnable from positive examples under the 
condition that the number of clauses in an EES and the length of a clause are 
bounded by some constant. 

2 Preliminaries 

In this section, we give some basic definitions. Let A, X and U be mutually 
disjoint sets. We assume that S is finite. We refer to each element of A as a 
eonstant symbol, to each element of A as a variable, and to each element of II 
as a predieate symbol. Each predicate symbol is associated with a non-negative 
integer called its arity. For a set A, we denote the set of all finite strings of 
symbols from A by A* and the set A* — {e} by A+, where e is an empty string. 
A term is an element of (AUA) + . A term is said to be ground if it is an element 
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of If'*". An atomic formula (atom, for short) is of the form p(si, S 2 , ■ ■ ■ , Sn), 
where p is a predicate symbol with arity n and each Si is a term (1 < * < n). 
An atom p(si, S 2 , • . . , Sn) is said to be ground if all s\, S 2 , ■ ■ ■ , Sn are ground. In 
this paper, constant symbols are denoted by a,b , . . variables by x,y , . . terms 
or sequences of terms by s,t , . . and atoms by A,B,. . .. 

A definite clause (clause, for short) is of the form A -<r- B\, . . . ,Bn (n > 0), 
where A,Bi,..., Bn are atoms. The atom A and the sequence B\, . . . , Bn of 
atoms are called the head and the body of the clause, respectively. We refer to 
either a term, a finite sequence of terms, an atom or a clause as an expression. 
For an expression E, the set of all variables occurring in E is denoted by v(E). 
For an expression E and a variable x, the number of occurrence of a; in A is 
denoted by oc(x,E). An elementary formal system (EFS, for short) 1 1 ,'lj is a 
finite set of clauses and denoted by E. 

For an EFS E, the set of all predicate symbols occurring in E is denoted by 
77/-. A substitution is a finite set of the form {xi/ Si , . . . , a;„/s„}, where x\, . . . ,Xn 
are distinct variables and each s/ is a term distinct from (1 < i < n). Let 
E be an expression. For a substitution 9 = {xi/si, . . . , a;„/s„}, E9, called an 
instance of E, is the expression obtained from E by simultaneously replacing 
each occurrence of the variable Xi in E with the term Si (1 < i < n). 

We give two semantics of EFS’s, provability and fixpoint semantics. 

First, we introduce the provability semantics. Let E and C be an EFS and a 
clause. Then, the provability relation 7^ h C is defined inductively as follows: 

1. If C e 7" then 7" h C. 

2. If 7^ h C then E h C9 for any substitution 6. 

3. If 7^ h A •<— 7?i, . . . , Bm and E h Bm •<— then 7^ h A •«— 7?i, . . . , Bm-i- 

A clause C is provable from E ii E C. The provability semantics of an EFS E 
is the set: 

PS(E) = {A I A is ground and 7" h A ^}. 

The second semantics is based on the least fixpoint of the functions Tp. Let E 
and S be an EFS and the set of ground atoms. Then, the function Tp is defined 
as follows: 

Tr(S) = {A I there exists a ground instance A ^ 77i, . . . , 7?„ of a clause in E 
such that Bi G S for any i(l < 7 < n)}. 

Clearly, the function Tp is monotonic for any EFS E. The fixpoint semantics 
Tpfuj of an EFS E is defined as follows: 

1. TrtO = 0, 

2. Tpfn = Tp(Tpf(n - 1)) for n > 1, 

3. Trtw = T/fn. 

We can show that the two semantics are equivalent for every EFS ca 
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3 Linealy-Moded EFS’s 

Rao m has proposed the class of Prolog programs based on moding information 
and linear predicate inequalities. Here, the inequality has been defined as an 
inclusion over the multisets corresponding to terms. In this section, we redefine 
an inequality as a relation over the length of each term. Then, we introduce the 
class of EFS’s with linear predicate inequalities. The EPS in this class includes 
some local variables, which occurs in only the body of each clause under some 
conditions. 

First, we define a partial order over finite sequences of terms. Note that 
we deal with not only terms but also finite sequences of terms in the following 
definition. 

Let s and t be finite sequences of terms. Then, we denote s > t if |s6*| > \t9\ 
for any substitution 9. We can easily show that s > t if and only if |s| > |t| and 
oc{x, s) > oc{x, t) for any x € v{t). Hence, the following lemma obviously holds. 



Lemma 1. For any pair (s,t) of finite sequences of terms, the problem of deci- 
ding whether s > t or not is solvable. 

For an n-ary predicate symbol p, a mode Fp of p is a function from {1, . . . , n} to 
{in, out}. The fth argument of p is called an input (resp., output) argument if 
Fp{i) = in (resp., out). In order to simplify notations, we assume that, for any i 
and j, if Fp{i) = in and Fp(j) = out, then it holds that i < j. Then, we separate 
input arguments in an atom from output ones by the special symbol that 
is, we denote an atom by p(ti , . . . , tm', tm+i, ■ ■ ■ , tn), where Fp{i) = in for any i 
(1 < i < to) and Fp{j) = out for any j (to -I- 1 < j < n). For a predicate symbol 
p, the set of all i such that Fp{i) = in is denoted by in{p). An EFS is said to be 
a moded EFS if a mode is defined for every predicate symbol in the EFS. 

Let T be a moded EFS and go be a binary predicate symbol in Ur such that 
Ejo(l) = ill Fqg{2) — out. Then, a translation defined by F, denoted by 
Trans{F), is defined as follows: 

Trans{F) = |(s,f) € x \ qQ(s;t) S PS{F)}. 

Let T be a moded EFS. Then, an input selection / of T is a function from Fir 
to the set of natural numbers such that I{p) C in(p) for any predicate symbol p 
in Hr- 

Definition 1. Let T be a moded EFS and I be an input selection of F. Then, 
for an atom A = p{si , . . . , Sm] t), LI{A, I) is defined as the set of all substitutions 
9 such that (s ^^ , . . . , Si^)9 > t9, where I{p) = (ii, . . . , *„}. 

Note that, in the above definition, the set LI {A, I) can be regarded as the 
set of all answers to the inequality Si^, . . . , Si^ > t for two sequences Si„, . . . , Si^ 
and t of terms. Then, the class of EFS’s with linear inequalities is defined as 
follows. 
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Definition 2. Let F be an EFS and I be an input selection of F. Then, the 
EPS F is linearly-moded w.r.t. I if each clause 



Po{so]to) ^ Pi{si\ti),. . . ,pm{s m ) ) 



in F satisfies the following conditions: 



1. if 0 G LI{pi{si;ti), I) for any i (1 < i < j — 1), then sq 9 > Sj9 for each 

< j < m), and 

2. if 0 G LI{pi{si', ti),I) for any i {1 < i < m), then 0 G LI{po{so]to),I). 

The EFS F is linearly-moded if there exists a function I such that F is linearly- 
moded w.r.t. /. 



For a linearly-moded EFS’s, each clause in the EFS may have local variables 
under some conditions, which is a useful to express binary relations by EFS’s. 



Example 1. Let /(go) = /(gi) = /(g 2 ) = 1 and F be an EFS defined as follows: 



F = 



' qo{xx;yz) ^ qi{x;y),q2{y; z), ' 
qi{axb; ayb) ^ qi{x]y), 
qi{ab; ab) 

< q2{ax;ya) ^ q2{x;y), >. 

q2{bx-,yb) ^ q2{x; y), 
g2(a;a) 

^Q2{b;b)^ 



In the first clause of F, {l)xx > x, {2)x > y implies xx > y, and (3)a: > y and 
y > z implies xx > yz, where (1) and (2) are corresponding to 1, while (3) is to 
2 in Definition |3 Since all clauses in F satisfy the conditions of linearly-moded 
EFS’s, F is & linealy-moded EFS. The translation defined by F is 



| n > 1}. 



By Definitional the following lemma holds: 

Lemma 2. If F is a linearly-moded EFS, then each clause 



Po{so',to) ^ Pi{si-,ti),. . . ,p^{s rm Im) 
in F satisfies the following conditions: 

F if m = 0 then v{to) C r;(so), and 

2. v(si) C v(sq) U v(ti) U • • • U v(ti-i) for any i (1 < t < m) . 

Since the relation s>t is based on the length of each instance of s and t, it can 
be regarded as the inequality over the set of integers. Therefore, the following 
proposition holds. 

Proposition 1. For any moded EFS F, the problem of deciding whether F is a 
linearly-moded or not is solvable. 
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Proof. For a term s, define the following expression [s]: 

[s] = oc(x, s)x + k, 

xGv(s) 



where k is the number of constant symbols occurring in s. Then, the problem of 
deciding whether |s| > \t\ or not is reduced the satisfiability problem of the linear 
inequality [s] > [t] over the set of integers. Hence, for a given input selection J, 
the problem of deciding whether P is linearly-moded w.r.t. / or not is solvable. 
Since the number of possible input selections / is finite for a given moded EFS 
r, the problem is solvable. Q 

4 Properties of Linearly-Moded EFS’s 

We can characterize linearly-moded EFS’s by the relationship between input and 
output sentences generated by them. In this section, we show some properties 
for the translation defined by a linearly-moded EFS. 

Lemma 3. For every linearly-moded EFS F, if a ground atom A = p(s; t) is an 
element of PS{F), then it holds that |s| > |f|. 

Proof. We show that if p{s] t) S Tpfi then |s| > \t\ by the induction on i. 

If z = 1 then there exists a clause A' = p(s'; t') ■<— S F such that A = A'O 
for a substitution 0. By Definitional Is'crl > \t'a\ for any substitution cr. Hence, 
k| = > \t'9\ = \t\. 

Assume that A G Tpfk. Then, there exists a clause A' <— Hi, ... , Bm G F 
such that A = A'O and BiO G Trt(k-l). By the induction hypothesis, |si0| > \ti9\ 
for any Pi{si;ti) = Bi {1 < i < m). Hence, it holds that 9 G LI{Bi,I) for any 
*(!<*< ’m). Since 9 G LI(A',I) by Definition 0 it holds that |s| = |s'0| > 

\t'9\ = \t\. n 

By Lemma El the class of translations defined by linearly-moded EFS’s is incom- 
parable with the class of translations defined by all SDTS’s, because the SDTS 
can express a translation in which the length of an output sentence is greater 
than the length of its input sentence. 

By the following theorem, it is shown that the translation defined by a 
linearly-moded EFS is recursive. 

Theorem 1. Let A be a ground atom and F he a linearly-moded EFS. Then, 
the problem of deciding whether F \- A G- or not is solvable. 

Proof Suppose that |A| = n and let fc be \Flr\-Y^^=i I^T- If A G Trt(fc+1)— T/f 
k, then, for any i {1 < i < k), there exists an atom Bi G Tpfi — Trf{i — 1) such 
that |Hi| < n by the definition of Tp and Definition 0 Since the number of all 
ground atoms whose length is at most n is at most k, there exist two atoms C 
and D in the sequence A, Bk, . . . ,Bi such that C=D. This fact contradicts to 
the monotonicity of Tp. Hence, if A G Trfuj then A G Tpfk for some k. 
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For a set S, we denote the set {A G S' | |A| < j} by S/j. Then, we can easily 
show that Trt(* + l)/j = Tr(Tr[i/ j) / j for any i > 0 and j > 0. Since the set 
Trti/j is finite and computable for any j > 0 and j > 0, and T h A ^ if and 
only if A G Tjfk/n, the problem is solvable. Q 

It is shown that all of the output sentences can be computed from input 
sentences by the following two propositions. 

Proposition 2. Let F be a linearly-moded EFS, and p(s; t) be an atom sueh 
that r h p{s; t) ■<— . If s is ground, then so is t. 

Proof. We say that a clause C is k-provable from T if C is provable from F using 
at most k applications of the rules in the definition of the relation F \- C. We 
prove the statement by induction on k. 

Suppose that p(s; t) G- is 1-provable from F. Then, p(s; t) F. By Defini- 
tion |2| it holds that v{t) C w(s). Since s is ground, t is ground. 

Suppose that p{s; t) g- is fc-provable (fc > 2) from T. Then, there exists a 
clause p{sQ]to) g- qi{si,ti), . . . , qm{sm]tm) S F and a substitution 9 such that 
p{so] to)d = p{s; t) and all of the atoms qi{s\,ti)9 , . . . , qm{sm] tm)d are {k — 1)- 
provable from T. For any i {m > i > 1), if all of the atoms qi{si,ti)9 , . . . , qi-\ 
{si-i,ti_i)9 are ground, so is qi{si,ti)9 by Lemma O and the induction hy- 
pothesis. Then, all of the atoms qi{si,ti)9, . . . ,qm{sm]tm)d are ground. Since 
v{to) C u(so) u v{ti) U • • • U vftra), to9 is ground. □ 

Proposition 3. Let F be a linearly-moded EFS. Then, for any ground term s, 
the set {t G \ {s]t) G Trans{F)'\ is finite and computable. 

Proof. We can easily show that (s; t) G Trans{F) if and only if qo{s',t) G 
Trfuj. Furthermore, if qo{s;t) G Trfuj then qo{s;t) G Tr\k/2n, where k = 
\^r\J2'iZi Since the set Trfk/2n is finite and computable, the set {t G 
I (s;t) G Trans{F)} is also finite and computable. Q 

The expressive power of the linearly-moded EFS’s is characterized by LemmaEI 
and the following proposition. 

Proposition 4. A linearly-moded EFS can define the binary relation over context- 
sensitive languages. 

Proof. Consider the class of linearly-moded EFS’s in which each clause is of 
the form po(so; ) G- pi(si; ), . . . ,Pm(sm', )■ The class contains all of the length- 
bounded EFS’s jn|, which can define context-sensitive languages. Q 

5 Learnability of Linearly-Moded EFS’s from Positive 
Examples 

In this section, we discuss the learnability of two classes of EFS’s from positive 
examples. One is the linearly-moded EFS’s, and the other is the slightly larger 
class of linearly-moded EFS’s. 
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Let • be any recursive enumeration of linearly-moded EFS’s. Then, 

the class C = Trans(ri), Trans{r 2 ), ... is an indexed family of recursive sets. 
A translation is a subset of x . A semantic mapping is a mapping from 
EFS’s to translations. A semantic mapping M is monotonic if T' C T implies 
M{r') C M{r). An EES F is reduced w.r.t. a set S of atoms if for any F' C T, 
S C M{F) but S 2 M{F'). A concept defining framework is a triple (U,E,M) 
of a universe U of objects, a universe E of expressions, and a semantic mapping 
M. 

Definition 3. A concept defining framework {U, E, M) has bounded finite thickn- 
ess if M is monotonic, and for any finite set S Q U and any n (n > 0), the set 

{M(F) I F is reduced w.r.t. S and \F\ < n} 



is finite. 

Shinohara HH has shown that if a concept defining framework C = (U,E,M) 
has bounded finite thickness, then the class 

Ck = {M{F) I r C a and \F\ < k} 

is learnable from positive examples. 

We denote the set of all linearly-moded EFS’s in which each clause has at 
most TO atoms in its body by A"*. Consider the concept defining framework 
(A+ X Trans). Then the following theorem holds. 

Theorem 2. For any k > 0, the class 

E]fi = {Trans{F) \ F C E'^ and \F\ < k} 
is learnable from positive examples. 

Proof. We show that the concept defining framework (17+ x A+, A"*, Trans) has 
bounded finite thickness for any to > 1. 

Since the function PS is monotonic, so is the function Trans. Let n be a 
positive integer. S' be a finite subset of i7+ x 17+, and I be the maximum length 
of s such that (s,t) G S. If a linearly-moded EES F is reduced w.r.t. S and 
\F\ < n, then each clause po(so;to) ^ Pi(si;ti), . . . ,Pi{si;ti) G F satisfies the 
condition that |sj| < I and < I for any j (0 < j < z). Since i < m and the 
number of all predicate symbols in F is at most n, the set 

{Trans{F) \ F is reduced w.r.t. S and \F\ < n} 

is finite. By Shinohara’s theorem the class E)fi is learnable from positive 
examples. Q 

Consider the learnability of the larger class of linearly-moded EFS’s. Let s 
and t be finite sequences of terms, and I be a positive integer. Then, we denote 
s >’’ tif |s0|-|-? > \t9\ for any substitution 9. For example, x ax and ab aab. 
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An l-linearly-moded EPS is defined by replacing > with in Definition ^ and 
Definition EJ For the translation defined by an Flinearly-moded EFS, the length 
of each output sentence is at most I greater than the length of its input sentence. 

Let F{1)”^ be the class of all Hinearly-moded EFS’s in which each clause has 
at most m atoms in its body. Then, the following theorem holds. 

Theorem 3. Suppose that m > 2 and fc > 3. Then, the class 

F{1)^ = {Trans{F) \ F C F(Z)'" and \F\ < k} 

is not learnable from positive examples for any / > 1. 

Proof. The class E(l)| contains the following EFS’s Fn (n > 1) and Too: 

r<7o(a";a")^, 'I 

= < go(x;x) <- suc(x; y), go (y;z), , 

[ suc(x; ax) t— 

^ f go(a;a) 

°° \ go(ax; ay) ^ go(x; y) 

Note that only the third clause of Fn does not satisfy the conditions in Defi- 
nition |2l because \x9\ -I- 1 = \ax9\ for any substitution 9. Then, Trans{Fn) = 
{(a*, a*) I 1 < * < n} and Trans{rao) = {(a*, a*) | * > 1}. Since Trans{Fi) C 
Trans{Fi.\.i) and Trans{Fi) C Trans{Foo) for any * > 1, the class T(Z)| is super- 
finite. Hence, it is not learnable from positive examples la- □ 

6 Conclusion 

In this paper, we have introduced a linearly-moded EFS with local variables and 
linear predicate inequalities based on mode information. The class of linearly- 
moded EFS’s is so large as to express translations over context-sensitive langu- 
ages. For the translation defined by a linearly-moded EFS, the set of all output 
sentences is finite and computable for each input sentence. Furthermore, we have 
shown that the class of translations defined by linearly-moded EFS’s is learnable 
from positive examples under the condition that the number of clauses in the 
EFS’s and the length of each clause are bounded by some constant. 

As a further extension of EFS’s, we have formalized an Z-linearly-moded 
EFS. For the translations defined by an Z-linearly-moded EFS, the length of 
each output sentence is at most I greater than the length of its input sentence. 
It is a natural extension, because the 0-linearly-moded EFS is equivalent to that 
of a linearly-moded EFS. However, the class of translations defined by /-linearly- 
moded EFS’s is not learnable from positive examples for any I > 1 even if both 
the number of clauses in each EFS and the length of bodies of each clause are 
bounded some constant. Therefore, we need another investigation to extend the 
learnability of translations without the restriction on the relationship between 
the lengths of input-output sentences. 
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Abstract. Classification is one of major tasks in case-based reason- 
ing(CBR) and many studies have been done for analyzing properties 
of case-based classification |iii4imii,Mi2iini7| . However, these stud- 
ies only consider numerical similarity measures whereas there are other 
kinds of similarity measure for different tasks. Among these measures, 
HYPO system in a legal domain uses a similarity measure based on 
set inclusion of differences of attributes in cases. 

In this paper, we give an analysis of representability of boolean functions 
in case-based classification using the above set inclusion based similarity. 
We show that such case-based classification has a strong connection be- 
tween monotone theory studied in mn- Monotone theory is originated 
from computational learning theory and is used to show learnability of 
boolean function with polynomial DNF size and polynomial CNF size |3] 
and is used for deductive reasoning as well DU- In this paper, we analyze 
a case-based representability of boolean functions by using the above re- 
lationship between the case-based classification by set inclusion based 
similarity and the monotone theory. We show that any boolean function 
is representable by a casebase whose size is bounded in polynomial of its 
DNF size and its CNF size and thus, fc-term DNF, fc-clause CNF can be 
efficiently representable in a casebase using set inclusion similarity. 



1 Introduction 

In this paper, we show a correspondence between case-based classification using 
set inclusion similarity and monotone theory mn\ in learning of boolean func- 
tions and analyze case-based representability of boolean functions by using the 
correspondence . 

Classification is one of the main tasks of case-based reasoning(CBR). For ex- 
ample, CBR has been used for classfication of pronunciation of unknown English 
words in!> a telex classification jS], and the census classification taskjOI. 

For case-based classification, we introduce a distance measure between cases 
and retrieve the most similar case to the current case and regard the classifi- 
cation label of the retrieved case as the label of the current case. This kind of 
usage of CBR is sometimes called memory-based reasoning instance-based 
learning [Q, or the its origin, nearest-neighbor classification |S(. 
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This way of learning contrasts with the usual inductive learning mechanism 
where we extract abstract regularity which reflects tendency of examples. Al- 
though it is more desirable to induce abstract information from examples, if we 
focus on a classification task only, we do not need such information. Actually, Q 
shows that the performance of instance-based learning for several classification 
tasks is compatible with other inductive learning mechanism such as Quinlan’s 
C4. Moreover, instance-based learning has advantages of its simple and lazy 
manner of learning. 

To understand the behavior of case-based classifiers, many theoretical anal- 
yses have been done [ II 1 411 1)11 511 21911, 'fl7] . Among these studies, umniz] give 
analyses of representativity of concepts in a case-based manner. [Hjj analyzes 
case-based represent ability of pattern languages, HSl gives upper and lower 
bounds on sample complexity to present various concept classes in the nearest 
neighbor algorithm and investigates efficient case-based representations for 
some classes of boolean functions. A motivation of these analyses of case-based 
classifier is explained in lEI; we would like to know how many cases are needed 
for the system to “learn” the concept exactly. Another motivation is that we 
should analyze representability before analyzing learnability since if the concept 
cannot be represented efficiently, the concept cannot be learned efficiently cni. 
Although the results of the above studies are important in their own rights, con- 
sidered case-based classifiers in these studies are based on numerical similarity 
measures. 

On the other hand, there are other similarity measures used in different tasks 
of CBR. Among existing CBR systems, a legal CBR system, HYPO m uses 
a similarity measure based on set inclusion of differences of attributes in cases. 
The original usage of HYPO is to retrieve similar or contrasting precedents for 
an input case to create an arguments for the input case. However, we can use 
this similarity measure for classification as follows. The current case is classified 
as positive if there is a positive case which shares a set of factors with the current 
case and there is no negative case such that the set of shared factors between 
the negative case and the current case includes the set of shared factors between 
the positive case and the current case. This idea is actually implemented in 
abductive logic programming m to decide whether an input case is preferable 
to plaintiff side or defendant side. 

In this paper, we show that case-based reasoning with the set inclusion based 
similarity is closely related to monotone theory studied in BCH- Monotone the- 
ory is originated from computational learning theory and is used to show learn- 
ability of boolean function with polynomial DNF size and polynomial CNF size. 
Moreover, dH shows that the idea is also applicable to deductive reasoning task. 
We use the monotone theory for a different purpose, that is, an analysis of repre- 
sentability of case-based classifier using set inclusion similarity. Specifically, we 
show that a boolean function defined by a casebase with our similarity measure 
is a complement of a monotone extension ^ such that a set of positive cases in 
the casebase is called basis in ^ and negative cases are assignments in the mono- 
tone extension. By using this relationship, we show that any boolean function / 
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is representable by a casebase whose size is bounded in polynomial of its DNF 
size and its CNF size. The above implies that interesting classes such as A:-term 
DNF, A:-clause CNF can be efficiently representable in case-based classifier using 
set inclusion similarity measure. 

The structure of the paper is as follows. In Section |21 we give preliminary 
definitions. In Section |3 we propose a new similarity measure based on set 
inclusion for case-based classification and give some properties of the similarity. 
In Section 0 we show a correspondence between set inclusion based case-based 
classification and monotone theory. In Section 0 we discuss representability of 
boolean function by using the above correspondence. Finally, we conclude our 
paper by summarizing our contribution and future research. 



2 Preliminaries 

We consider a boolean function {0, 1}" {0, 1}. To represent a boolean function 

syntactically, we use a boolean formula in the usual way which consists of a tuple 
of boolean variables {Xi, ..., Xn) and logical connectives such as A, V and ^ which 
denotes boolean AND, OR and NOT operator respectively. We denote a negation 
of a formula F as F called the complement of F. A literal is either a variable Xi 
(called a positive literal) or its negation ^Xi (called a negative literal) . A clause 
is a disjunction of literals. We say that a variable appears positively in a clause 
if a positive literal of the variable appears in the clause and a variable appears 
negatively in a clause if a negative literal of the variable appears in the clause. 
A CNF formula is a conjunction of clauses and a DNF formula is a disjunction 
of conjunctions of literals. Note that there are many CNF representations of 
the same boolean function and many DNF representations of the same boolean 
function. We denote the DNF size of a boolean function / as \DNF{f) \ meaning 
the minimum possible number of conjunctions in any DNF representation of / 
and the CNF size of a boolean function / as \CNF{f)\ meaning the minimum 
possible number of clauses in any CNF representation of /. 

We use a bit vector x S {0,1}" to give a value of a boolean function rep- 
resented by a formula. We assign the value of the Tth component of x, x[i\, to 
the variable Xi and interpret a formula in the usual way. We say that an assign- 
ment X satisfies boolean function / if f{x) = 1. We sometimes regard a boolean 
function / as a set of assignments satisfying /, that is, for an assignment x, we 
write X G f in stead of f{x) = 1. We also regard a set of assignments S' as a 
boolean function, that is, for an assignment x, we write S{x) = 1 if and only if 
X € S. We use an interpretation function (j) from a formula to a boolean func- 
tion represented by the formula. If S' is a representation of a boolean function 
/, then 4>{F) = f. For a boolean function /, / expresses the complement of / 
which defines as f{x) = 1 if and only f{x) = 0 for every assignment x. Note 
that for a set representation of boolean function, / = {0, 1}" — /. 
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3 Case-Based Classification 

We regard an assignment vector in {0,1}" as a case. This means that a case is 
represented as an n boolean- valued attributes. 

Definition 1. For cases c\, C 2 , we define d(ci, C 2 ) as ci 0C2 where (B is a bitwise 
EXCLUSIVE-OR operation ({c\ 0 C 2 )[i] = 1 */ and only if ci[i] ^ C 2 [i])- 

We write d{c\,c) ^ d{c 2 ,c) where ^ denotes a partial order over a vector 
x,y G {0,1}” such that xfi^y if and only i/Vt(l <i< n),x[i] < y[i]. 

We write d(ci, c) d(c 2 , c) if d(ci, c) ^ d(c 2 , c) and d{c\,c) d{c 2 ,c). 

In the above definition, d expresses a difference set of a pair of cases and 0 is 
based on set-inclusion relation expressing that difference set of a pair is included 
in the difference set of the other pair. 

Definition 2. Let CB be a pair of two disjoint sets of cases {CB'^ ,CB^) . 

We call CB a casebase, CB'^ a set 0 / positive cases and CB^ a set 0 / negative 
cases respectively. 

Let CB a casebase {CB'^ ,CB~) . We say that a boolean function fcB is repre- 
sented by a casebase CB if 

fcB — {c|3Cofc Cl CB S.t. ^Cjig C CB d{cng^cj ^ d(Cofc,c)} 

Note that in the above definition fcB is represented as a set of assignments. 
For a function representation, /cs(c) = 1 if and only He C fcB- We can recognize 
the analogue to instance-based learning by numerical- valued similarity. However, 
note that d(c„g,c) d{cok,c) does not always imply d{cok,c) -< d{cng,c). 

Example 1. Let {Xi, X 2 , X 3 , X 4 } be a set of variables. We consider a boolean 
function / : {0, 1}^ i-^- {0, 1} such that a representation of / in a CNF formula 
is 

(^^1 V ^X2 V ^^ 3 ) A {X 2 V ^X^ V ^Xfi). 

Then, assignments satisfying / are: 

{ 0000 , 0001 , 0010 , 0100 , 0101 , 0110 , 0111 , 1000 , 1001 , 1010 , 1100 , 1101 }, 

where an assignment are represented as a sequence 6162 ^ 3^4 meaning that bi is 
the assigned value to W- 

Let CB be a casebase {CB^ ,CB^) where CB^ = {0010,0101} and CB^ = 
{0011, 1110, 1111}. We can show that / is represented by CB, that is, fcB = / by 
checking for every c C f there exists some Cok C CB'^ such that every c„g G CB^ , 
d(^C, Cjig^ ^ d{c,Cokfi 

For example, let us consider an assignment 1000 G / and 0010 G CB'^ . Note 
that d(0010, 1000) = 1010. Then, 

- for 0011 G CB-, d(0011, 1000) = 1011 means d(0011, 1000) ^ d(0010, 1000). 

- for 1110 G CB-, d(1110, 1000) = 0110 means d(1110, 1000) ^ d(0010, 1000). 

- for 1111 G CB~, d(llll,1000) = 0111 means d(llll,1000) ^ d(0010, 1000). 
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Therefore, 1000 S fcs- 

On the other hand, let us consider an assignment 1011 ^ /. For 0010 € 
there exists 0011 S CB~, d(0011, 1011) ^ d(0010,1011) and for 0101 G C6+, 
there exists 1111 G , d(llll, 1011) ^ d(0101, 1011). Therefore, 1011 ^ fee- 

A combination of the above d and A leads to the following important prop- 
erty which will be used for correspondence between case-based classification and 
monotone theory. 

Lemma 3. Let c, ci,C 2 be cases. d{ci,c) ^ d(c 2 ,c) if and only if d{ci,C 2 ) :< 
d{c, C 2 ) 

Proof: Clearly, x :< y if and only if x = xSzy where = is a bitwise equality 
and & is a bitwise AND operation. Therefore, d(ci,c) ^ d{c 2 ,c) if and only if 
(ci 0 c) = (ci © c)&(c 2 © c). 

Then, this is equivalent to ci = (C 1 &C 2 ) © (C&C 2 ) © (ci&c) by using that 
c&c = c, and ci © c = AT © c if and only ci = X. 

Similarly, c?(ci, C 2 ) ^ d(c, C 2 ) if and only if ci = (C 1 &C 2 ) © (C&C 2 ) © (ci&c) □. 
Note that if a difference function is defined as numerical-valued function, we 
do not have the above symmetrical property. 

Lemma 4. Let CB be a casebase (CB'^,CB~). Let fcB be a boolean function 
represented by CB. 

fcB — {c|3Cofc G CB S.t. ffCjig G CB d{Cng.iCok] ^ d(c, Cofc)} 

Proof: By Lemma 0 □ 

Note that the above lemma is important if we use a casebase for classification 
task. When we decide to fix a casebase, given a new case c to be classified, 
we need to compute both of d{c,Cng) and d{c,Cok) in the original definition 
of representation of boolean function. However, by the above Lemma, we can 
“precompile” a fixed casebase for efficient classification, that is, we can compute 
d{cng, Cok) in advance. Then, we only need to compute c?(c, Cok) for a classification 
task. 

We can detect redundant negative cases by using the following lemma. 

Lemma 5. Let CB be a casebase {CB~^ ,CB^) . Let fcB be a boolean function 
represented by CB. Let C„g G CB^ and CB' = {CB^ ,CB' ) where CB' = CB^ — 
\_^ng\ • Lf for all Cok G CB , there exists Cng G CB s.t. di^c^gj Cok) — di^C^g^ Cok) ■ 
Then fcB = fcB' ■ 

Proof: Clearly, fcB C fcB' ■ Suppose that fcB ^ fcB'- Then, there exists some 
C such that C ^ fcB and C G fcB'- This means: 

- € CB^3Cng e CB~ s.t. d{Cng.C) ^ 

3Cok ^ CB ^ Cing € CB s.t. di^Cfig-, ^ d(^Cok-i ^) ■ Let Cok t)6 such Cok' 
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Then, c?(C„g, C) ^ d{Cok,C). By Lemma0 this means d{Cng, Cok) ^ d{C, Cok)- 
However, since there exists c„g G CB' , d{cng,Cok) ^ d{Cng,Cok) by the con- 
dition of Cng, there exists Cng G CB' , d{cng,Cok) ^ d{C,Cok)- This implies 
d{cng,C) :< d{Cok,C) again by Lemma 0 and leads to contradiction. □ 

Example 2. Consider the same boolean function /, CB'^ and CB^ in Example Q 
Note that / = U{1011}. For 0010 G CB'^ , there exists 0011 G CB~ such that 

d(0011, 0010) ^ d(1011, 0010) and for 0101 G CB^ , there exists 1111 G CB~ such 
that d(llll,0101) ^ d(1011, 0101). Then, fee = fcB' where fcB = (CB'^.f) and 
fcB' = {CB'^ ,CB^) as Lemma 0 states. 

Definition 6. Let S be a set of cases and c he a case. We define the nearest 
cases of S from c, NN{c,S), as follows. 

NN{c,S) = {c' G 5'h3c" G S s.t. d{c,c") ^ d(c,c')} 

Corollary 7. Let CB be a casebase {CB^ ,CB ). 

LetCB' = (CS+,Uc„.eCB+ NN{cok,CB-)). Then, fcB = fcB'- 

Proof: Suppose c„g ^ NN{cok,CB^). Then, for every Cok G CB^ , 

Cng ^ NN{cok,CB^). This means that there exists c" G CB^ s.t. d{cok,c") ^ 
d{cok,Cng). Therefore, by Lemma El fcB = fcB" where CB" = {CB^,{CB^ — 
{cng})). Even after removing Cng fromCK^, Uc„A,eCB+ NN{cok, (CB^ -{cng})) = 
UcofcSCB+ ^^(cok,CB^), since otherwise, c„g was in Uc„fceCB+ NN{cok,CB~). 
Therefore, we can remove all Cng such that Cng ^ Uc keCB+ ^^{cok,CB^) from 
CB~~ without changing fes and thus, fcB = fcB' ■ 

This corollary also helps speedup of classification together with Lemma 0] 
since we do not need to compute d(c„g, Cok) for redundant negative cases. How- 
ever, note that when a positive case is added, we must check redundancy for 
unused negative cases again. 

4 Relation between Case-Based Classification and 
Monotone Theory 

We follow the definition by PEU. 

Let 2 , X and b be assignments. We define z x as d{z, b) < d{x, b). Let / be 
a boolean function. The b-monotone boolean function of f is A4b(/) = {a:|32; G 
f,zd:b a:}- 

Let / be a boolean function and S be sets of assignments. We write Ms{f) = 
Hbgs ■Adh(/). We call Ais{f) monotone extension of boolean function f w.r.t. 

S. 

The following is the main theorem of this paper expressing a relationship 
between case-based resoning and monotone theory. 

Theorem 8. Let CB be a casebase {CB'^ ,CB^). We regard CB^ as a boolean 
function such that for an assignment x, CB~{x) = 1 if and only if x G CB~ . 
Then, fcB = Mcb+{CB^). 
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Proof: 

fcB = {c\3cok e CB'^ s.t. Wcng G CB^ d{cng,Cok) ^ d{c,Cok)} (by LemmaEj) 
— {cj^’VCoA; G CB ^Cng G CB ^ng —Cok 
= {c|VCofc G CB'^3Cng G CB~Cng :<c„k c} 

“ ^ —Cok c} 

“ ric„fceCB+ (CS ) 

= Mcb^CB-) □ 

Therefore, a boolean function represented by case-based classification using set 
inclusion similarity is the complement of monotone extension of negative cases 
w.r.t. positive cases. 





Fig. 1 . Afooio({llll, 1110 , 0011 }) and Moioi({llU, lUO, 0011 }) 



Example 3. Let CB^ = {0010, 0101} and CB = (0011, 1110, 1111} as in Exam- 
ple ^ In each lattices of FigureQis induced by difference between an assignment 
in a node and the assignment in the bottom node. Underlined bits of an assign- 
ment express a difference from an assignment of the bottom node and crossed 
nodes of the lattice on the left express Adooio(C^ ) and one on the right ex- 
press Mowi{CB^). Note that Mcb+{CB~) = Adooio(C^ ) n Adoioi(C^ ) = 
(0011, 1011, 1110, 1111} which coincides with / in Example E 

By using the above correspondence, we can apply findings from monotone 
theory to analysis of case-based classification. In the sequel, we paraphrase re- 
sults and proofs in the study of monotone theory in terms of case-based learning. 

The following is a dual of Claim 4.7 HH which is related to the size of CB'^ . 

Lemma 9. Let CB~^ be a set of cases and D\ V ... V Dk be a DNF represenation 
of a boolean function f . Suppose that for every Di, there exists Cok G CB^ such 
that Cok G 4>{Di). Then, f = fcB where CB — {CB~^ , /). 
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Proof: 

Since fcB = -^CB+if) t>y Theorem|Hl we show / = McB+{f)- Since it is always 
the case that / C McB+{f): -^CB+if) ^ so, it is sufficient to show 

that for every c satisfying /, there is some positive case Cok & CB^ such that 

Suppose c satisfies /. Then, there exists a conjunct D of the above DNF 
formula which does not contain a positive literal and a negative literal of the 
same variable simultaneously such that c € This means that f(l < i < n), 

— c[f] = 1 if Xi appears in D. 

— c[i] = 0 if ~^Xi appears in D. 

Let Cok S CB^ be a case satisfying Cok G This also means that for 

every fyl < i < n), 

“ Cok[i] = 1 if appears in D. 

— Cok[i] = 0 if ~^Xi appears in D. 

Now, we show that c ^ Suppose that c G -Adc„fc(/)- Then, there exists a 

case z satisfying / such that <i(z, Cok) ^ d(c, Cok)- This means that if c[i] = Cok[i] 
then z[i] = Cok[i] Therefore, i{l < i < n), 

— z[i] = 1 if Xi appears in D. 

— z[i] = 0 if ^Xi appears in D. 

This means z G / and this leads to contradition. Thus, / C McB+if)- 

Example 4- Consider the same boolean function / in Example Q A DNF repre- 
sentation of / is 

{^Xi A X 2 ) V {^Xi A ^^ 4 ) V {^X2 A ~^X4) V ^^ 3 . 

Then, for every conjunct D in the above representation there exists Cok G CB'^ 
in Example n such that Cok G </>(D). Let CB be a casebase (CB'^,f). We can 
easily check that / = fcB as Lemma 0 states. 

For the next lemma, we use a set of cases PNN{b, /) defined as follows: 

{c|c ^ / s.t. there is no c' ^ / s.t. d(c', b) A d{c, b) and 

there exists I s.t. c[l] fy c'[l] and c[j] = d[j] for j fy 1{1 < j < n)} 

PNN{b, /) is a set of pseudo nearest neighbor assignments from b with respect 
to /. Note that NN{b,f) C PNN{b,f). 

The following is a dual of Claim 4.8 El which is related to the size of C;B . 

Lemma 10. Suppose that Di A ... A Dk be a CNF represenation for a boolean 
function f and b he a case. Then, \PN N{b, f)\ < k. 

Proof: 

Let D be any clause in the above CNF representation. If D contains the 
positive literal and the negative literal of the same variable, then it becomes 1 
and can be ignored. Otherwise, we define a case Cmin{D) w.r.t. a clause D in the 
above CNF representation of / as follows. For every j(l < j < n), 
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“ CminiD)[j] = 1 if Xj appears negatively in D. 

— Cmin{D)[j] = 0 if Xj appears positively in D. 

— Cmin{D)[j] = b[j] if Xj does not appear in D. 

Consider such c that for some I, Cmin{D)[l] ^ c[l], and Cmin{D)[j] = c[j] for 
j + ^(1 < j < n). 

— If Xi appears negatively in D, then c[l] = 0 and so, cG f. 

— If Xi appears positively in D, then c[l] = 1 and so, cG f 

— If Xi does not appear in D, then c[l] ^ b[l] and so, d{cmin{D),b) d{c,b) 

Therefore, Crmn{D) G PNN{b,f). 

Suppose c' G f. Then, there exists some clause D' in the above CNF repre- 
sentation of / such that for every j(l < j <n), 

— d[j] = l if Xj appears negatively in D' . 

— d [j] = 0 if Xj appears positively in D' . 

Suppose d ^ Cmin{D'). Then, d is different from Cmin{D') only in such j-th 
values that Xj does not appear in D' . This means d{cmin{D'),b) A d{d,b). 
Therefore, PNN{b,f) = {cmin{D)\D is a clause in the above CNF representa- 
tion of /} and thus \PNN{b, f)\ < k. (Note that Cmin{D) might be equal to 
other Cmin{D') for other D'). U 

Corollary 11. Let f be a boolean function which has a CNF representation 
Di A ... A Dk and b be a case. For every case b, \NN{b, f)\ < k. Especially, 
\NN{b,f)\<\CNF{f)\. 

Proof: NN{b,f) C PNN{b, f) and by Lemma E3 

5 Representability 

We modify the definition of 0 of polynomial representability so that we confine 
similarity measure to the one based on set inclusion. 

Let S' be a set of cases or assignments. Then, we denote the number of cases 
or assignments in S as |S|. Let CB he a. casebase {CB'^ ,CB^). We denote the 
sum of \CB^\ + \CB~ \ as |C;B| which we call the size of the casebase CB. 

Definition 12. Let Tn be a class of boolean function from {0, 1}" to {0, 1}. Let 
T = Xn ■ P is called polynomially representable iff there exists a polynomial 
p{-) such that, for every n and every f G Fn, there exists a casebase CBn that 
satisfies fcB„ = / \CBn\ < p{n). 

The following theorem gives the upper bound of representability of boolean 
functions. 

Theorem 13. Let f be a boolean function. Then, there exists a casebase CB = 
{CB+,CB^) such that \CB+\ < \DNF{f)\, \CB-\ < \DNF{f)\ ■ \CNF{f)\ and 
\CB\ < \DNF{f)\{l+\CNF{f)\). 
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Proof: There is a DNF representation DiV ...V Dk of / such that k = \DNF{f)\. 
If we take a set of cases each of which satisfies each Di as CB~^ , then by LemmaEl 
there exists / = fcs' where CB' = (CB'^,f) and \CB'^\ < \DNF{f)\. Then, by 
CorollaryEI / = fee where CB = {CB^,CB~) where CB^ = [J_c„^eCB+NN{cok, /)• 
Then, since by Corollary ITTl for each Cok G CB^, \NN{cok, f)\ < \CNF{f)\, 

\CB-\< \CNF{f)\<\DNF{f)\-\CNF{f)\. 

CofcSCB+ 



□ 

Corollary 14. Let Tn be a class of boolean function from {0,1}" to {0,1}. Let 
T = yS^—itFn- If for every f G Fn, \DNF{f)\ cbnd \C'NF{f)\ is polynomial of 
n, then T is polynomially representable. 

There are interesting classes of boolean function which have the above prop- 
erty. 

Theorem 15. The class of k-term DNF is a class of boolean functions whose 
DNF contains at most k conjunctions, k-term DNF is polynomially representable. 

Proof: For every n-ary boolean function / in fc-term DNF class, \DNF{f)\ < k 
and |CA^F’(/)| < n^. Therefore, by Theorem 1 1 111 there exists a casebase CB such 
that / = fcB and \CB\ < n^{k + 1). □ 

In a similar way, the class of /c-clause CNF (a class of boolean functions whose 
CNF contains at most k clauses) is polynomially representable. 

So far, we are concerned about a boolean function to be represented as a 
casebase. Suppose that a casebase represents a boolean function correctly. If we 
can decode the boolean function in a rule form, it might be useful for a user to 
get a hint of a structure of the boolean function. For example, if we apply our 
case-based learning to domain of legal reasoning, then to extract a rule from the 
casebase corresponds to make a case law which tells a lawyer when the previous 
case can be applied to the current case. This kind of decoding can be done as 
follows. 

Definition 16. Let CB be a casebase {CB~^ ,CB^) . For every Cok G CB'^ , we 
construct CNF formula as follows and combine them withy to result in a formula 
denoted as Fcb- 

For every Cng G CB~ , we construct a clause in the CNF as follows. 

— If Cng[i] = 0 and Cok[i] = 1, then Xi is in the clause. 

— If Cng[i] = 1 and Cok[i] = 0, then ~^Xi is in the clause. 



Theorem 17. Let CB be a casebase and Fcb is constructed in a way of Defini- 
tion ESI Then, 4 >{Fcb) = fcB- 
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Proof: Let CB = (CB+,CB-). 

(1) (j){FcB) C fcB 

Suppose c e (I){Fcb)- Then, there is a CNF formula D G Fcb defined in Defini- 
tion M such that c G 4>{D). 

Let Cofc be a corresponding case with D. This means that for every clause E 
in D, c G Etnd so, there is i(l < i < n) s.t. either c[i] = 1 and Xi G E 

or c[i] = 0 and ~^Xi G E. Therefore, for every Cng G CB^ , there is i s.t. either 
c[i] = 1 and Cng[i] = 0 and Cok[i] = 1 or c[i] =0 and Cng[i] = 1 and Cok[i] = 0. 
This means that for every Cng G CB~, d(c,Cng) d(c,Cok)- Therefore, c G fcB- 

(2) fcB C 4>{Ecb) 

Suppose c G fcB- Then, there is Cok G CB'^ such that for every Cng G CB^, 
d(c, ^ d{c.^Cok)- 

This means that for every Cng G CB~ , there is i s.t. c[i] = Cok[i] and c[i] ^ 
Cng[i]- This means that for such i, c[i] = 1 and Cok[i] = 1 and Cng[i] = 0 or 
c[i] = 0 and Cok[i] = 0 and Cng[i] = 1- 

Let Zl be a corresponding a CNF formula D G Fcb with Cok defined in 
Definition El Let if be a clause in D which corresponds with Cng G CB^ . By 
the construction of E, we can show that c G </>(if). Therefore, c G 4>{Ecb)- LI. 

Example 5. Consider Example Q again. Note that CB'^ — {0010,0101} and 
CB~ = {0011,1110,1111}. For 0010 G CB'^ , we produce the following CNF 
formula: 

-^X^ A {^Xi V ~^X2) A {^Xi V ~^X2 V ^X^). 

And for 0101 G CB~^ , we produce the following CNF formula: 

{X2 V ^^^3) A {~^X\ V ^X^ V X4) A {^X\ V ^^^3). 

By combining these two CNF formulas by V, we have a formula which is 
logically equivalent to the formula in Example 0 

6 Conclusion 

We believe that the following are contributions of this paper. 

— We propose a new similarity measure for case-based classification based on 
set inclusion and show properties which can be used for reduction of redun- 
dant cases. 

— We show a correspondence between a boolean function represented by a case- 
base with the proposed similarity measure and a boolean function defined 
by monotone theory. 

— By using the correspondence, we show that for every function /, we can 
represent / in acasebase whose size is bounded by |I?NE(/)|(l-|-|CiVF(/)|). 

— We show how to construct a formula which is equivalent to a boolean function 
represented by a casebase. 

As future research, we would like to do the following. 

— In 0, boolean functions are learned by membership query and equivalence 
query. Equivalence query can be replaced by sampling oracle for PAC learn- 
ing framework. We study this method for active construction of casebase. 
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— We would like to extend our method to represent other kinds of concept such 
as concept with numerical attributes. 

— We would like to apply our method to real applications for evaluation. 
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Abstract. In algorithmic learning theory fundamental roles are played 
by the family of languages that are locally testable in the strict sense 
and by the family of reversible languages. These two families are shown 
to be the first two members of an infinite sequence of families of regular 
languages the members of which are learnable in the limit from positive 
data only. A uniform procedure is given for deciding, for each regular 
language R and each of our specified families, whether R belongs to the 
family. The approximation of arbitrary regular languages by languages 
belonging to these families is discussed. Further, we will give a uniform 
scheme for learning these families from positive data. Several research 
problems are also suggested. 



Keywords: reversible languages, local languages, regular languages, identifica- 
tion in the limit from positive data, approximate learning 



1 Introduction 

In several natural ways one can provide, with a single definition, the global spe- 
cification of exactly one binary irrefiexive relation in the set of states of every 
automaton. We will use each such globally specified class of binary relations to 
define a family of regular languages. Two trivial examples will provide clarifica- 
tion of what we mean by such a global specification: the empty relation, and the 
relation of inequality. We will see that these two extremes provide the upper and 
lower range, respectively, of the families of languages definable by the means we 
propose. The empty relation provides the family of all regular languages and the 
relation of inequality provides the family of languages that are locally testable 
in the strict sense. The family of reversible languages is an intermediate example 
of a family definable in the suggested manner. 
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We use this general technique to provide a sequence of language families 
{T>Bn I n is a non-negative 

integer} for which T>Bq coincides with the family of languages that are locally 
testable in the strict sense and VBi is the family of reversible languages. A 
foundational fact given by D. Angluin (PI) is that each reversible language con- 
tains a constructable finite subset that determines the language from among the 
members of its family at its level. Each such subset she called a characteristic 
sample of the language. We observe that characteristic samples exist not only 
for the reversible languages, but for all members of our sequence T>Bn- The sig- 
nificance of Angluin’s characteristic samples is that they provide the basis for 
an algorithm for learning the languages that contain them in the limit from po- 
sitive data only. Thus her results tell us that her learning procedure could be 
generalized to work also for all of the members of our sequence T>Bn (n > 0). 

We provide an elementary algorithm that decides in an entirely uniform 
manner whether a given regular language is a member of a given family of 
any of the types defined here. For each regular language R and for each of the 
language families 'DBn, we construct a minimal member (at the fc-th level) of 
the family that contains R. This construction is also done in an elementary and 
uniform manner for all families. For an understanding of the significance of this 
construction for the theory of approximate learning, see |2|. 

Results reported here are related to research on biomolecular phenomena 
through the applications of learning theory to the semantics of DNA (H3| and 
HSl) and through the methodological similarity with work on splicing systems 
(0, 0 and Pj) which model DNA recombinant behaviors. 

2 Preliminaries 

Let A be a finite alphabet. By A*, we denote the set of all strings over A. By 1, 
we denote the null string. A subset of A* is called a language over A. A set of 
languages is called a family of languages. For a language L and a string w, by 
w\L, we denote the set {x & A* \ wx € Lj. We write wittlW 2 if wi\L = W2\L. 

For an equivalence relation tt over A* and a string w, by [w],r, we denote an 
equivalence class of tt containing w. For equivalence relations tti and tt2 over A*, 
7Ti is said to be finer than tt2 if for any s,t ^ A*, sTTit implies sTT2t. An equivalence 
relation tt over A* is called a right congruence if for any s, f G A* and a G A, 
STTt implies saTrta. Note that for any language L, ttl is a right congruence. 

By an automaton over A, we mean a formal system M = (A, S, I, F, E), 
where A is a finite set that serves as the input alphabet, 5 is a finite set of 
states, / is the set of initial states, F is the set of final states, and E is the set 
of directed edges labeled by elements of A. Thus I and F are subsets of S and 
we may regard E as, a subset of S' x A x S'. In case that the cardinality of / is 1 
and for every p G S and a G A, (p, a, qi), (p, a, 92) G E implies qi = (72, M is said 
to be deterministic. We allow the set of initial states to contain more than one 
element of S, but minimal automata will be assumed to have only a single initial 
state. For a state p and a string w, pw will denote the set of all states which are 
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accessible from the state p with the transition w. In case that pw is a singleton 
{q}, we simply write pw = q. L{M) will denote the language recognized by M, 
i.e., L{M) = {ic S yl* I 3po S / such that p^w contains some element in F}. 
An automaton is said to be trimmed if every state is accessible from an initial 
state and, for every state q, there is a final state which is accessible from q. A 
language is regular if it is recognized by a finite automaton. 

For a language L over A, by Pre{L), we denote the set of all prefixes of 
elements of L. Let tt be a right congruence which is finer than Then, it is 
straightforward to see that for a given language L, we can construct a trimmed 
automaton = (A, 5, /, F, F) such that = L using tt as follows: 



S = {[w]^ I w e Fre(F)}, 

i = mA, 

F = {Mtt I W e L}, 

E = {{[wi]^,a,[wia]^) \ wi € Pre{L)}. 

It is well known that is a trimmed minimal automaton of L. This 

construction of an automaton for a given language L will be used in Section El 

Let Ml = (A, Fi, Ji, Fi, Fi) and M 2 = (A, 5*2, / 2 , -F 2 , F 2 ) be trimmed auto- 
mata. We write M\ < M 2 if and only if there exists a function 6 from Si to 
S 2 such that (1) 0{li) C I 2 , (2) 9{Fi) C F 2 , (3) for any p € Si and a € A, 
0{po) C 9{p)a holds. 

Proposition 1. Mi ^ M 2 implies L{Mi) C F(M 2 ). □ 

The following argument gives a method for constructing a new finer right 
congruence from a given right congruence. Let tt be any right congruence over 
A*. A string wi = Ci • • • a„ (a^ S A) is said to be n-reduced to a string W 2 at 
positions (*, j) if ai ■■■ Ui tt ai ■■■ a j , 1 < i < j < n and W 2 = ai ■ ■ ■ OiOj+i ■ ■ ■ a„. 
Let W 2 be a 7r-reduction of wi at positions (i,j)- Then, W 2 is called a left-most tt- 
reduction of wi if there exists no 7r-reduction of wi at positions {i',f) such that 
j' < j. Such a left-most 7r-reduction is determined uniquely. We write wi A W 2 if 
W 2 is a left-most 7r-reduction of Wi . Let A * be a reflexive and transitive closure 
of — >■. Then, we write a: 7f y if there exists some string w such that x ^ * w and 
y ^ * w. Note that 7f is a right congruence which is finer than tt. A string x is 
said to be minimal with respect to tt if there exists no string y such that x ^ y 
and X ^ y. 

An interesting right congruence, which plays an significant role in this article, 
will be defined bellow. For a string w, by sufk{w), we denote the fc-length suffix 
of w. In case that the length of w is less than k, sufk{w) is defined as w itself. For 
a non-negative integer k and a language F, we write wiTTL,kW2 if \ F = u >2 \F 
and sufk{wi) = sufk{w2)- In case of fc = 0, coincides with ttl. Note that 
TiL,k is a right congruence for each k and F and TTL,k is finer than ttl- For a given 
language F and a non-negative integer k, the right congruence will be used 
in Section 0 to show the existence of a characteritic sample of F with respect to 
the family of F>F„ languages at its fcth level. 
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3 The Definitions Chosen for the LTS Languages and 
the Reversible Languages 

The concept of a language that is locally testable in the strict sense {LT S) was 
first defined by R. McNaughton and S. Papert (El). Most authors are now 
using modifications of their definition. Unfortunately, a language such as ab * a, 
which is in LT S by the original definition, is not LT S according to some current 
definitions. This situation is confusing. For simplicity we will use as our definition 
a property that is easily confirmed to hold for all of the definitions of the family 
LTS that are currently in use. For this purpose Schutzenberger’s concept of a 
constant (H2|) will be used: A string w in A* is a constant relative to a language 
L if, whenever uwv and swt are in L, both uwt and swv are also in L. It is easy 
to see that if w is a constant relative to L, then for any strings x,y G A*, xwy 
is also a constant relative to L. The following proposition is easily verified: 

Proposition 2. A string w is a constant relative to a language L, if and only 
if for any uw, sw G Pre{L), uw \ L = sw \ L holds. □ 

Recall that a string a: is a factor of a language L if there are w and y in 
A* for which wxy is in L. We let Fac{L) = {x G A* | x is a factor of L}. For 
additional evidence of the appropriateness of the following definition see uni. 

Definition 1. A language L is k-locally testable in the strict sense (k-LTS), 
for a non-negative integer k, if every string of length k in A* (equivalently, in 
Fac{L)) is a constant relative to L. We say that L is locally testable in the strict 
sense (LTS) if it is k-LTS for some k. 

Note: With this choice the family of LTS languages coincides with the family 
of null context splicing languages ( 0 , 0 and 0 ). 



We take the characterization of the concept of a fc-reversible language given 
by Angluin (Theorem 14 in (2j) as the basis of our definition, but we prefer to give 
her definition in a slightly different form to allow us to emphasize what we believe 
is a fundamental concept lying at the core of her definition (characterization). 
This new concept is a generalization of Schutzenberger’s concept of a constant: 
A string w in A* is a semiconstant relative to a language L if, whenever uwv, 
swt, and uwt are in L, swv is also in L. This allows the following equivalent of 
Angluin ’s definition: 

Definition 2. A regular language L is k-reversible, for a non-negative integer 
k, if every string of length k in A* (equivalently, in Fac{L)) is a semiconstant 
relative to L. We say that L is reversible if it is A:-reversible for some k. 

The versions of the definitions given here underscore the fact that the concept 
of reversibility is a generalization of the LT S concept — in the same sense that 
the concept of a semiconstant is a generalization of the concept of a constant. 
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Thus by our definition of the family k-LTS, each k-LTS language is also k- 
reversible. For the definition of McNaughton and Papert (Cl) it is known that 
each LTS language is reversible (0) and that each k-LTS language is (A:+l)- 
reversible (lemma 13 in 1151 1. 

4 A Broader Context for the LTS and the Reversible 
Families 

The following concept provides the basis for the embedding of the algorithmics 
of the LTS languages and the reversible languages into a common context. By 
an irreflexive class B we will mean a globally defined class of irrefiexive binary 
relations, one for each automaton, defined on the set of states of the automaton. 
Recall that one such irrefiexive class is provided by the empty relation and 
another is provided by inequality. 

Definition 3. With each irrefiexive class B we associate the family T>B of regu- 
lar languages: A regular language R belongs to the family I)B if, for the trimmed 
minimal automaton of R with a state set S, the set Bad{B, R) = {w G A* \ 
3p,q G S, pw is defined, qw is defined, and pwBqw} is finite. 

Note: Since in the definition above we have required that the minimal automa- 
ton is trimmed, replacing w G A* hy w G Fac{R) does not change the meaning 
of the definition. 



Several notations are required for adequate discussion of examples. 

With each automaton M = (A, S', /, F, E) we associate several additional 
automata and make fundamental use of the languages they recognize. The set 
of all factors of the regular language R that is recognized by the automaton 
M is the regular language Fac{R) = L{{A, S, S, S, E)). With each state q in 
S we associate the regular languages I{q) = L{{A,S,{q},F,E)) and F{q) = 
L{{A, S, S, {g}, E)). Thus I{q) consists of the strings that initiate at q and ter- 
minate at a state in F, and F(q) consists of the strings that initiate anywhere 
in S and terminate at q. 

Note: Suppose that pBp holds for some state p of a trimmed minimal automaton 
M. Then, L{M) does belong to T>B only if I{p) is finite. This observation moves 
our attention to the irreflexive class B in Definition 0 

Example 1. For each non-negative integer n, let Bn be the irrefiexive class glo- 
bally specified in the state set of each trimmed automaton by: 

Bn = {{p,q) \ p and q are distinct and I{p) fl I{q) contains at least n strings}. 

Note that Bq is simply the set of all pairs of distinct states in the trimmed 
automaton considered. The examples above are central to the significance of this 
article. The following observations have motivated this presentation. 
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Proposition 3. The family VBq coincides with the family of LTS languages. 

Proof. For a language L G T’Bqj satisfies the condition of Def. i Thus, 

Bad{Bo, L) is finite, and let k be the maximum length of strings in Bad{Bo^ L). 
(In case of Bad{Bo, L) being empty, we set k = —1.) Let w be any string of 
length fc + 1 and assume that uwv and swt are in L. Then, the states 
and of must be equivalent, since otherwise, [itrc],n.j^i?o[s'w] 7 ri, implies 

w G Bad{BQ, L), a contradiction. Therefore, uwt and swv are in L, which implies 
that every string of length fc + 1 is a constant relative to L. Thus, L G LTS. 

For a language L G LTS, there exists a non-negative integer fc such that 
L is fc-locally testable in the strict sense. Thus, every string of length fc is a 
constant relative to L. Consider the trimmed minimal automaton of L. 

Let w be any string of length greater than fc. Note that w is a constant relative 
to L. Then, w ^ Bad{Bo,L) holds, since by Prop. El for any states and 
of [uw]tt^ and must be equivalent, i.e., 

Therefore, Bad{Bo, L) is finite, which implies L G T>Bo. □ 

Proposition 4. The family T>Bi coincides with the family of reversible langu- 
ages. 

Proof. Omitted. Similar argument of the proof of Proposition 0 can be applied. 

□ 

Our principle result for machine learning appears in Section 0 and confirms 
that each T>Bn shares the excellent learnability properties of T>Bq and T>Bi . 
Example 2. The following examples are given only to indicate the ease with 
which additional natural examples of irreflexive classes can be generated. We 
have not explored the significance of these examples. 

(1) BNC = {{p,q) I Neither of I{p) and I{q) is contained in the other}. 

(2) BII = {{p,q) I I{p) and I{q) are distinct, but have infinite intersection}. 

(3) BFN = {{p, q) I Precisely one of the states p and g is a final state}. 

(4) BNG = {{p,q) I Precisely one of the languages I{p) and I{q) is non-coun- 
ting}. 

In the remainder of this section we provide an algorithm for deciding whether 
a given regular language i? is a T>B language or not in a uniform manner. By 
definition, the question “i? G T>BT is equivalent to ask whether Bad{B, R) is 
finite or not. It is easily verified that the computation of Bad{B,R) is carried 
out in the following way: 

Algorighm 1. For each regular language R and each irreflexive class B, decide 
whether R lies in T>B as follows: 

Construct the two regular languages using a trimmed minimal automaton 
of R : 

Good{B, R) = n{(^(p) C Fi.d)) I pBq], 

Bad{B,R) = Fac{R)\Good{B,R). 

Decide if the regular language Bad{B, R) is finite. 

R lies in T>B if and only if Bad{B, R) is finite. 
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Note: 

(1) Any word that is not a factor of L is vacuously ‘good’. Consequently we also 
have Bad{B, R) = A*\Good{B, R), i.e., Good and Bad partition of A*. 

(2) For every B and every R, A*Good{B, R) = Good{B, R), i.e., Good is always 
a left ideal of A*. However, for the irreflexive classes given in Example ^ 
above, it is easily confirmed that A*Good{Bn, R)A* = Good{Bn, R), i.e.. Good 
is a two sided ideal. 

(3) For local testability in the strict sense, constants are ‘good’, others are ‘bad’. 
For reversibility, semiconstants are ‘good’, others are ‘bad’. Constants have pro- 
ven to be of great value in the study of splicing systems well beyond their use 
in generating LTS languages (fZ|). This suggests that semiconstants will prove 
to be of value beyond the theory of reversible languages. 



Algorithm 1 allows a non-negative integer, ^B{R), to be associated with 
each T>B language R: If Bad{B,R) is empty, ^B{R) = 0, otherwise ^B{R) is 
1 -I- the length of the longest string in Bad{B, R). This definition allows us to 
say: For a T>B language all factors of length at least ^B{R) are ‘good’. 

Definition 4. Let fc be a non-negative integer. A VB language is called a k-VB 
language if =ffB(R) < k. 

Theorem 1. If a regular language R is in T>B it must be in k-VB for k = n^, 
where n is the number of states in the trimmed minimal automaton of R. 

Proof. Suppose that R is in VB but not in k-VB for k = n^. Then there is a 
pair of states p, q and a word w of length at least k for which pwBqw. Since 
length w is at least and there are only ordered pairs of states, there must 
be a factorization w = xyz, with y not null, such that the ordered pair {px, qx) 
is identical with {pxy,qxy). It follows that, for every non-negative integer i, 
pxy'‘zBqxy'‘z. This means that Bad{B, R) is not finite and consequently that we 
have arrived at the contradiction: R is not VB. □ 

Corollary 1. If a regular language R is LTS it must be n^-LTS where n is the 
number of states in the trimmed minimal automaton of R. □ 

Corollary 2. If a regular language R is reversible it must be n^-reversible where 
n is the number of states in the trimmed minimal automaton of R. □ 

Proposition 5. In case of ffA > 2, for any nonnegative integers k,l,n, k- 
VBn+i — l-VBn yf 0 holds. 

Proof. Let a, b be distinct elements in A. Let Fi and F 2 be finite sets of strings 
of length n -I- 1 such that #(Fi fl F 2 ) = n and F\ — F 2 ^ 0. Let v be any string 
of length I and let L = avFi + bvF 2 . 

Then, it is straightforward to see that for any nonnegative integer k, L G (k- 
VBn+i — l-VBn) holds. □ 

Thus, for any nonnegative integer k and n, k-VBn is a proper subset of 
k-VBn+i. 
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5 Characterizations of k-T>Bn Languages 

In this section, we give some characterizations of the k-'DBn families, where we 
will introduce a generalized notion of a constant or a semiconstant. 

Let Tn be the set of all finite subsets of A* whose cardinality is equal to n. 
The first characterization is given as follows: 

Theorem 2. A regular language L is a A:- 2 ?i 3 „language if and only if for any 
strings U\,U2 G A*, any string w of length k such that u\w, U2W G Pre{L), and 
any element F G u\wF C L and U2wF C L imply u\w \ L = U2W \ L. 

Proof. Assume that A is a A:- 2 ?i?„language. Then, the maximum length of strings 
in Bad{Bn, L) is bounded by k—1. Let ui,U2,whe any strings such that \ w \= k, 
and F be any element in Suppose that u\wF C L and U2wF C L. Then, the 
states and [u2w]t^^ of the trimmed minimal automaton should be 

equivalent, since otherwise = [uiw\T^^Bn[u2w]T^j^ = [u2]ttlW holds and 

therefore w G Bad{Bn, L), a contradiction. Therefore, we have uiw\L = U2w\L. 

Assume that the ‘if’ condition of the claim holds. Let w be any string of length 
greater than or equal to k. Consider any distinct states and [u2]ttl of the 

minimal trimmed automaton In case that I{[uiw]t^^) C\ I{[u2Vj\T^jf) con- 

tains at most n— I strings, -'{[uiw\T^^Bn[u2w]T^^) holds. In case that I{[uiw\t^^)C\ 
I{[u2w]tt^) contains some F G by the assumption, we have uiw\L = U2w\L. 
Therefore, = [u2w]t^^ holds, which implies ~<{[uiw]Tr^Bn[u2w]TrL). Thus, 

in any case we have ~'([uiw],riSn[u2w]7ri)- Therefore, w ^ Bad{Bn, L), i.e., L is 
a fc-T>i?„language. □ 

For a non-negative integer k and a language L, a string w is an n-weak 
constant relative to L if, whenever ( 1 ) uwv, swt G L and ( 2 ) 3F G Fn such that 
uwF C L and swF C L, it holds that uwt and swv are also in L. The notion of 
0 -weak constant coincides with that of a constant. Further, the notion of 1 -weak 
constant coincides with that of a semiconstant. In this sense, the notion of an 
n-weak constant is a natural hierachical generalization of a constant. 

The following proposition is straightforward: 

Proposition 6. A string w is an n-weak constant relative to a language L if and 
only if for any strings U\W,U2W G Pre{L) and any element F G Fn, u\wF C L 
and U2wF C L imply u\w \ L = U2W \L. □ 

Theorem 3. A regular language L is a /c- 2 ?B„language for a non-negative in- 
teger k, if and only if every string of length k in A* (equivalently, in Fac{L)) is 
an n-weak constant relative to L. 

Proof. By Th. 0 and Prop. El □ 

We introduce a class of automata, which could be used to characterize k- 
2 ? languages. An automaton M is called a k-FFnautomaton if it is trim- 
med, deterministic and for any states p,q oi M and any string w of length 
k, -t{pwBnqw) holds. The class of k-VBi automata coincides with the class of 
fc-reversible automata (| 3 )- The following proposition is immediate from Def. 0 
and Def. 0 
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Proposition 7. A language L is a fc-Pi?„language if and only if the trimmed 
minimal automaton of L is a A:-lD-B„automaton. □ 

It is known that L is /c-reversible if and only if L is accepted by some k- 
reversible automaton. The next characterization gives the generalization of these 
properties. 

Theorem 4. A language L is a fc- 2 ? language if and only if L is recognized 
by some fc-lDi 3 „automaton. 

Proof. Prop .0 gives the proof of ‘only if’ direction. 

Assume that there exists a fc-T>S„automaton M = (A, S, {po}) F, E) recogni- 
zing L. Let 7 Tm be a right congruence defined by sttmI if and only if pps = pf^t. 
Then, it is easily verified that is equivalent to M itself by allowing the 

renaming of the states. Recall that ttm is finer than since M recognizes L. 

Let ui,U2,w be any strings and F be any element of such that \ w \= k, 
UiwF C L and U2wF C L hold. Then, since M is a fc- 2 ? automaton, 

for the states and [u 2 ]-km of we have w). 

Therefore, we have “■( [u ["^2 w] which implies, by u\wF C L and 
U2wF C L, [uiw]ttm = [u 2 w]ttm- Thus, we have uiwttmU2W. Since ttm is fi- 
ner than 7 Ti, we also have uiwttlU2W. This implies, by Th. El that L is a, k- 
2 ?i?„language. □ 

With this theorem, we can conclude that the class of k-VBi languages coin- 
cides with the class of fc-reversible languages, since the class of k-VBi automata 
is equivalent to the class of fc-reversible automata. 

6 k-"DBn Approximation of Regular Language 

For a language L and a string w S Pre{L), we will select a string w' such that 
ww' G L, and denote it by tail{w). 

Let k and n be any non-negative integers. 

Define the sets: 

T( 2 ?, /c) = {(si, t, S2) I sits2 S1S2 € Pre{R) A 

S1S2 is minimal with respect to ^,}, 
A{R,n^k) = {s\P S2tail{siS2) \ (si,t,S2) & T{R,k), 0 < i < n}. 

Note: For an element SiPs2tail{siS2) G A{R,n,k)^ there is a reduction: 
SiP S2tail{siS2) SiP~^ S2tail{siS2) ••• SiS2tail{siS2) G R. Since 
TTRfk is finer than ttr, every element of A( 2 ?, n, k) is also in R. Therefore, we 
have A{R,n,k) C R. 

Lemma 1. Let R be any language. Let L be any /c- 2 ?i?„language containing 
A{R,n,k). For any strings U\,U2 G Pre{R), u\WRfkU2 implies uittrU2. 
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Proof. Let u be any string in Pre{R) and le be a minimal string for u with 
respect to ‘na^k- Assume that j (i > 0) is the number of steps of the left-most 
reductions from u to w. We will prove u \ L = w\ L hy induction on i. 

In case oi i = 0 , u = w, which immediately implies the claim. 

Assume that the claim holds for i < I, and consider the case of i ^ -I- 1. Let 
sitas2 S1S2 = w be the last left-most 7r/j^fe-reduction when u is left-most 
TTfl.fc-reduced to ic, where si, S27 1 G A* and a G A. Then, by the definition of the 
left-most reduction, there exists some u' G A* such that u = u'as2 and u' 

* s\t, where sit is minimal with respect to irn^k- By the induction hypothesis, 
u'\L = Sit \ L holds. Since is finer than by sitas2 S1S2, we have 
Sita TTR.k Si- Then, by the definition of A{R, n, k), both of Sita \ L and Si\ L 
contain the set {{tay S2tail{siS2) \ 0 < j < n — 1}. Furhter, by Sita ^ si, we 
have sufk(sita) = sufk(si). Therefore, by the fc-Pi3„property of L (Th.|2I), we 
have sita\L = si\L. Thus, we have u\L = u'as2 \L = sitas2 \ L = S1S2 \ L = 
w\ L holds, completing the induction step. 

For the string ui,U2 G Pre{R), by uiW]ij^U2, we have ui \ L = U2 \ L. This 
completes the proof. □ 



Corollary 3. Let R be any regular language. Let L be any fc-T>B„language 
containing R. For any strings ui,U2 G Pre{R), uiW]pfkU2 implies ui'KlU2- 

Proof. By Lem. [Hand A{R, n, k) C R. □ 

Algorithm 2. Given a regular language R and a non-negative integer k, con- 
struct the minimal fc-2?i?„language containing R as follows: 

Construct 

Assign i the value 0. Assign M{i) the value 

[Label] 

IF there exists no pair of distinct states p and q in the state set of M{i) 
satisfying either of the following condition (a) or (b), then halt. 

ELSE, construct M{i+ 1 ) from M{i) by merging states p\ and p2 in M{i) 
into a new state q\ 

(a) p\BnP2 holds and F{pi) fl F{p2) contains a k-length string, 

(b) there exists a G A and a state r in M(i) such that pi G ra and p2 G ra. 
Assign i the value i + 1 . 

Go to [Label]. 



Theorem 5. Algorithm 2 outputs an automaton recognizing the minimal k- 
T>i?„language containing the given regular language R. 

Proof. Note that any language recognized by an automaton which has only one 
state could be a fc-2?i3„language. Therefore, the algorithm terminates because 
the number of states of M(0) is an upper bound for the number of times the loop 
can be traversed. Further, the output of the algorithm is a fc- 2? automaton, 
since it has no pair of states satisfying conditions (a) or (b) . 
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Let L be any fc-X>_B„language containing R and S, {po}, F, E) 

be the minimal trimmed automaton of L. By Prop.^ it suffices to show that for 
every i > 0, M{i) < holds. We will prove this claim by induction on i. 

In case of t = 0, define 9q as do{[w]wwj^) = ^or. 0 this mapping is 

well-defined and it is straightforward to see M(0) ^ M. 

Assume that M{1) < and let 9i be a mapping from Si to S ensuring 

the relation M{1) ^ where Si is the state set of M{1). M{1 + 1) will be 

constructed by merging states p\ and p 2 satisfying the condition (a) or (b) . Let 
p'l = 9i{pi) and p '2 = Qi{p 2 )- In case of (a), by the induction hypothesis and 
by the fact that l is a fc- 2? automaton, p[ = p '2 holds, since otherwise 
F{p'i) n F{p'2) contains a fc-lenth string and p'iBnp'2 holds, which contradicts 
the definition of fc-X>B„automaton. In case of (b), by the induction hypothesis 
and by the determinisity of p'l = p '2 holds. Thus, in both cases, we have 

p'l =P'2- 

Then, we will construct a new mapping 0;_|_i from Si^i to S, where Si^i is the 
state set of 1) written as {Si — {pi,P 2 }) U{g}, defined by: 9i+i{p) = 9i{pi) 

a p = q, otherwise, 9i+i{p) = 9i{p). 

By p'-^ = P 2 j (^i+i ensures the relation M{1 -I- 1) ^ which completes the 

induction step. □ 

Although a minimal fc-DB language containing an arbitrary regular language 
R exists for each non-negative integer k, there may be no minimal T>B language 
containing R. This fundamental fact is illustrated in the following: 

Example 3. For the language L = ca*{b -I- e) -I- da*{b + /), one can confirm that, 
for each non-negative integer k, the minimal k-T>Bi language containing L is the 
language Lk = {ca'{b + e) \ 0<i<fc — 1}U {da'{b + f) \ 0<i</c — 1}U 
{(c -I- d)a'{b + e + f) \ i> k}. Consequently, the sequence | fc = 0, 1, 2, ...} 
is a strictly descending infinite nest of languages. Thus L is not contained in a 
minimal T>Bi language. It is interesting to note that the intersection of this nest 
is precisely L, a non-T>i3i language. 



7 Learnability Results for the Families k-T>Bn 

For a language L over A, a positive presentation of L is an infinite sequence 
a = Wi,W 2 , ... such that {wi G A* | i > 1} = L. 

Let C be a family of languages over A. Consider a class of representations TZ 
for C with the following properties: 

1. 7?. is a recursively enumerable language (over some fixed alphabet). 

2. For every L G C, there exists an r G TZ such that r represents L (denoted by 

L{r) = L). 

3. There exists a recursive function / such that for all r G 72. and w G E*, 

’ ' (0 otherwise 
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Thus, L{r) is the language represented by r. Note that {L{r) \ r G TZ} could be 
regarded as an indexed family of recursive languages{ci. Q), if we encode each 
r € TZ into positive integer. 

Let Cl and C2 be families of languages, and 7^ be a class of representations for 
C2. Let L be any language in Ci. We say that an algorithm M upper approximately 
identifies L in the limit from positive data using C2 and TZ if for any positive 
presentation of L, the infinite sequence, gi, 52, 53, (ffi G TZ), produced by M 
converges to an element r £ TZ such that L(r) is a minimal language in C2 
containing L. A family Ci is upper approximately identifiable in the limit from 
positive data using C2 and TZ if there exists an algorithm M such that M identifies 
upper approximately identify every concept in Ci in the limit from positive data 
using C2 and TZ. 

Note: In case of Ci = C2, we simply say that Ci is identifiable in the limit from 
positive data using TZ. This coincides with the original definition by | 2 | and 



Example 4- For the families T>Bn or k-T>Bn, we can use the representation class 
of finite automata. In the sequel, we will use this finite automata representation 
for these families of languages, and TZ is omitted if the context allows. 

The following notion of characteristic sample, introduced by | 2 |, plays an 
essential role in the theory of learning from positive data. 

Let C be a family of languages. For any language L of A*, a finite subset F 
of L is called a characteristic sample of L with respect to C if for any A G C, 
F f- X implies L C A. 

Theorem 6. Let i? be a regular language, and let n and k be non-negative 
integers. Then, there exists a characteristic sample of R with respect to the 
family of /c-27i?„languages. 

Proof. Since R is regular, the number m of equivalence classes of tt/j ^ is finite. 
Therefore, the number of elements of T{R, k) is finite, since for each (si, t, S2) G 
T{R,k), the lengths of si, t and S2 are bounded by m. Thus, A{R,n,k) is also 
finite. We will prove that this finite set A{R, n, k) is a characteristic sample of 
R with respect to the family of fc-T>i3„languages. 

Let L be any fc-T>i?„language containing A{R,n,k) and let = 

(A, S'!, {po}, Fi,Ei). Consider an automaton = {A, S2, 12, F2, E2) which 

recognizes R. Define a mapping 0 from S2 to Si such that for each state [w]w]^ G 
S2, = [w]nL holds. (Note that we can take w from the set Pre 

{A{R, n, k)).) Then, by Lem. Cl and by the fact that every final state of is 

accessible from the initial state by the transition using some string in A{R, n, k), 
it is straightforward to see that the mapping 9 ensures the relation ^ 

Mttl.l- Therefore, by Prop. ^ we have R C L. This completes the proof. □ 



Theorem 7. The family of regular languages is upper approximately identifia- 
ble in the limit from positive data using each of the families k-VBn. 
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Proof. The learning device M for this approximate learning is given as follows: 
For a new input Wi, M outputs the previous conjecture gt-i if L{gi-i) contains 
Wi, otherwise we will construct a trimmed minimal automaton M' accepting pre- 
cisely the union of {tCi} and the previous inputs {wi, and applies the 

Algorithm 2. The output of Algorithm 2 is the new conjecture at the stage i of M. 

It holds that at each stage i {i > 1), the output Mi of M represents a minimal 
language containing {wi, By TheoremEl there exists some stage j {j > 1) 
such that contains a characteristic sample F of the target regular 

language L with respect to the family of k-VBn languages. At the stage j, M 
outputs a conjecture representing a minimal k-T>Bn language containing L, since 
L{Mj) is a minimal k-T>Bn language containing {rci, ...,Wj} (3 F) and thus for 
any k-VBn language L' containing L, F C {rci, ..., Wj} Q L C L{Mj) C L' holds. 
Therefore, after the stage j, M does not change its conjecture. This completes 
the proof. □ 

Note: It should be noted that the learning algorithm of Th.Qhas several good 
properties such as consistency, conservativeness, responsiveness. (See 0 .) 

8 Problems 

1. Consider some (all?) significant families of regular languages for possible 
representation by means of irreflexive classes. 

2. Generate new families of regular languages by specifying potentially inte- 
resting irreflexive classes B and investigate the language families VB they 
determine for potential significance. 

3. For each family VB-. [i] Are there additional algorithms for treating the lan- 
guages in VBl [ii] Are there new properties of these languages that can be 
discovered based on the new definition? [iii] Are there previously known pro- 
perties that can be demonstrated in a simpler way, or with greater generality 
so that the property is seen to hold for additional VB families? 

4. Characterize by abstract properties the families of regular languages that 
admit definitions by irreflexive classes as discussed here. This may be possi- 
ble using a reformulation of our method in which one forms, for each regular 
language R, the product automaton M x M, where M is the minimal au- 
tomaton recognizing R. The irreflexive relation then becomes a set of states 
of M X M that is disjoint from the diagonal. Special attention may be given 
to those irreflexive classes B for which Good{B, R) is a two-sided ideal for 
each regular language R. 

5. Whether a language is regular or not, it has a unique minimal automaton, 
although the number of states will be infinite if the language is not regular. 
Are there tractable families of non-regular languages that can be defined in 
the present manner using (trimmed?) minimal automata having an infinite 
number of states? 

6. What role might the semiconstant strings relative to a language play beyond 
the context of the present investigation? 



204 



T. Head, S. Kobayashi, and T. Yokomori 



Acknowledgement 

We are grateful to anonymous referees for their valuable comments. The first 
author is grateful to M. Hagiya and the Molecular Computing Project members 
for inviting him to Japan. This research was supported in part by ’’Research 
for the Future” Program No. JSPS-RFTF 96100101 from the Japan Society for 
the Promotion of Science and by the National Science Foundation of the United 
States through grant CCR-9509831. 

References 

1. D. Angluin, Inductive inference of formal languages from positive data, Information 
and Control 45 (1980) 117-135. 

2. D. Angluin, Inference of reversible languages. Journal of the ACM 29 (1982) 741- 
765. 

3. E. W. Dijkstra, A note on two problems in connection with graphs, Numerishe 
Mathematik, 1 (1959) 269-271. 

4. E. Mark Gold, Language identification in the limit, Information and Control, 10 
(1967), 447-474. 

5. Head, T, Formal language theory and DNA: an analysis of the generative capacity 
of specific recombinant behaviors. Bulletin of Mathematieal Biology, 49 (1987) 
737-759. 

6. T. Head, Splicing representations of strictly locally testable languages, submitted 
for publication. (1997) 

7. T. Head, Splicing languages generated with one-sided context, submitted for pu- 
blication. (1997) 

8. S. Kobayashi and T. Yokomori, Families of non-counting languages and their learn- 
ability from positive data. Intern. Journal of Foundations of Computer Science, 7 
309-327 (1996). 

9. S. Kobayashi and T. Yokomori, Learning approximately regular languages with 
reversible languages, Theoretieal Computer Science, 174 (1997) 251-257. 

10. De Luca, A. and A. Restivo, A characterization of strictly locally testable languages 
and its application to subsemigroups of a free semigroup. Information and Control, 
44 (1980) 300-319. 

11. R. McNaughton and S. Papert, Counter-Free Automata, MIT Press, Cambridge, 
Massachusetts (1971). 

12. M. P. Schutzenberger, Sur certaines operations de fermeture dans les languages 
rationnels. Symposium Mathematicum, 15 (1975) 245-253. 

13. T. Yokomori, N. Ishida, and S. Kobayashi, Learning local languages and its appli- 
cation to protein alpha-chain identification. In Proe. of 27th Hawaii Intern. Conf. 
on System Sciences, IEEE Press, 113-122 (1994). 

14. T. Yokomori, On polynomial-time learnability in the limit of strictly deterministic 
automata. Machine Learning, 19 (1995) 153-179. 

15. T. Yokomori and S. Kobayashi, Learning local languages and its application to 
DNA sequence analysis, IEEE Transactions on Pattern Analysis and Machine In- 
telligence, to appear. 




s © 



Synthesizing Learners Tolerating Computable 

Noisy Data 



John Case^ and Sanjay Jain^ 

^ Department of CIS 
University of Delaware 
Newark, DE 19716, USA 
Email: case@cis.udel.edu 
^ Sanjay Jain 
Department of ISCS 
National University of Singapore 
Singapore 119260 
Email: sanjay@iscs .nus . edu. sg 



Abstract. An index for an r.e. class of languages (by definition) gene- 
rates a sequence of grammars defining the class. An index for an indexed 
family of languages (by definition) generates a sequence of decision pro- 
cedures defining the family. 

F. Stephan’s model of noisy data is employed, in which, roughly, correct 
data crops up infinitely often, and incorrect data only finitely often. 

In a completely computable universe, all data sequences, even noisy ones, 
are computable. New to the present paper is the restriction that noisy 
data sequences be, nonetheless, computable! 

Studied, then, is the synthesis from indices for r.e. classes and for indexed 
families of languages of various kinds of noise-tolerant language-learners 
for the corresponding classes or families indexed, where the noisy input 
data sequences are restricted to being computable. 

Many positive results, as well as some negative results, are presented 
regarding the existence of such synthesizers. 

The main positive result is surprisingly more positive than its analog 
in the case the noisy data is not restricted to being computable: gram- 
mars for each indexed family can be learned behaviorally correctly from 
computable, noisy, positive data! The proof of another positive synthe- 
sis result yields, as a pleasant corollary, a strict subset-principle or tell- 
tale style characterization, for the computable noise-tolerant behaviorally 
correct learnability of grammars from positive and negative data, of the 
corresponding families indexed. 



1 Introduction 

En-learners, when successful on an object input, (by definition) find a final cor- 
rect program for that object after at most finitely many trial and error attempts 

K!ol({7IRR7,»ii;Ms,JI(!bs'ii n 

^ Ex is short for explanatory. 



,M. Richter et al. (Eds.): ALT’98, LNAI 1501, pp. 205-^^] 1998. 
Springer- Verlag Berlin Heidelberg 1998 
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For function learning, there is a learner-synthesizer algorithm Isyn so that, if 
Isyn is fed any procedure that lists programs for some (possibly infinite) class S 
of (total) functions, then Isyn outputs an Ex-learner successful on S |(fol67| . The 
learners so synthesized are called enumeration techniques lfifi 7 niFulil()l . These 
enumeration techniques yield many positive learnability results, for example, 
that the class of all functions computable in time polynomial in the length of 
input is Ex-learnable0 

For language learning from positive data and with learners outputting gram- 
mars, [( )SWS8] provided an amazingly negative result: there is no learner- 
synthesizer algorithm Isyn so that, if Isyn is fed a pair of grammars 51,52 
for a language class C ={^1,^2}, then Isyn outputs an Ex- learner successful, 
from positive data, on £0 |TJCJ96| showed how to circumvent some of the sting 
of this jOSW88] result by resorting to more general learners than Ex. Exam- 
ple more general learners are: 'Bc-learners, which, when successful on an object 
input, (by definition) find a final (possibly infinite) sequence of correct programs 
for that object after at most finitely many trial and error attempts |B7klS88| FI 
Of course, if suitable learner-synthesizer algorithm Isyn is fed procedures for li- 
sting decision procedures (instead of mere grammars), one also has more success 
at synthesizing learners. In fact the computational learning theory community 
has shown considerable interest (spanning at least from [KIol67| to (ZES3) in 
language classes defined by r.e. listings of decision procedures. These classes are 
called uniformly decidable or indexed families. As is essentially pointed out in 
|Ang80|| , all of the formal language style example classes are indexed families. A 
sample result from [BCJ96j is: there is a learner-synthesizer algorithm Isyn so 
that, if Isyn is fed any procedure that lists decision procedures defining some in- 
dexed family C of languages which can be Bc-learned from positive data with the 
learner outputting grammars, then Isyn outputs a Bc-learner successful, from 
positive data, on C. The proof of this positive result yielded the surprising cha- 
racterization EM]: for indexed families C, C can be Bc-learned from positive 
data with the learner outputting grammars iff 

(VL e £)(3S' C L I S' is finite) (VL' S £ | S C L')[L' ^ L]. (1) 

dU is Angluin’s important Condition 2 from |Ang8U| , and it is referred to as the 
subset principle, in general a necessary condition for preventing overgeneraliza- 
tion in learning from positive data |Ang8bfhier8niXbkhRlk hih'iK !a~^ . 

IC,lky?ial considered language learning from both noisy texts (only positive 
data) and from noisy informants (both positive and negative data), and adop- 

^ The reader is referred to Jantke |,la,n79a,l,lan79hj for a discussion of synthesizing 
learners for classes of computable functions that are not necessarily recursively enu- 
merable. 

® Again for language learning from positive data and with learners outputting gram- 
mars, a somewhat related negative result is provided by Kapur |Kap91| . He shows 
that one cannot algorithmically find an Ex-learning machine for Ex-learnable inde- 
xed families of recursive languages from an index of the class. This is a bit weaker 
than a closely related negative result from mm- 
* Be is short for behaviorally correct. 
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ted, as does the present paper, Stephan’s |Ste95IC,IS98b| noise model. Roughly, 
in this model correct information about an object occurs infinitely often while 
incorrect information occurs only finitely often. Hence, this model has the advan- 
tage that noisy data about an object nonetheless uniquely specifies that object|^ 
In the context of jC,IS98a,) . where the noisy data sequences can he uncompu- 



table, the presence of noise plays havoc with the learnability of many concrete 
classes that can be learned without noise. For example, the well-known class of 
pattern languages |Ang80j H can be Ex-learned from texts but cannot be Bc- 
learned from unrestricted noisy texts even if we allow the final grammars each 
to make finitely many mistakes. While, it is possible to Ex-learn the pattern 
languages from informants in the presence of noise, a mind-change complexity 
price must be paid: any Ex-learner succeeding on the pattern languages from 
unrestricted noisy informant must change its mind an unbounded finite number 
of times about the final grammar. However, some learner can succeed on the 
pattern languages from noise-free informants and on its first guess as to a cor- 
rect grammar (see |LZK96| 1. The class of languages formed by taking the union 
of two pattern languages can be Ex-learned from texts EM; however, this 
class cannot be Bc-learned from unrestricted noisy informants even if we allow 
the final grammars each to make finitely many mistakes. 

In jC.IS98a,j . the proofs of most of the positive results providing existence of 
learner-synthesizers which synthesize noise-tolerant learners also yielded plea- 
sant characterizations which look like strict versions of the subset principle (P) HI 
Here is an example. If C is an indexed family, then: C can be noise-tolerantly 
Ex-learned from positive data with the learner outputting grammars (iff C can 



Less roughly: in the case of noisy informant each false item may occur a finite number 
of times; in the case of text, it is mathematically more interesting to require, as we 
do, that the total amount of false information has to be finite. The alternative of 
allowing each false item in a text to occur finitely often is too restrictive; it would, 
then, be impossible to learn even the class of all singleton sets IlStefibI (see also 
Theorem 0) . 

INix83l as well as ISA95I outline interesting applications of pattern inference algo- 
rithms. For example, pattern language learning algor ithms ha ve been successfully 
applied for solving problems in molecular biology (see fSSS-"94ISA9.^j V Pattern lan- 
guages and finite unions of pattern languages |8hi8.SlWri89| turn out to be subclasses 
of Smullyan’s [ISmufilj Elementary Formal Systems (EFSs). show that the 

EFSs can also be treated as a logic programming language over strings. The techni- 
ques for learning finite unions of pattern languages have been extended to show the 
learnability of various subclasses of EFSs EEHU- Investigations of the learnability 
of subclasses of EFSs are important because they yield corresponding results about 
the learnability of subclasses of logic programs. use the insight gained from 

the learnability of EFSs subclasses to show that a class of linearly covering logic 
programs with local variables is TxtEx-learnable. These results have consequences 
for Inductive Logic Programming [M K94IU )94| . 

For £. either an indexed family or defined by some r.e. listing of grammars, the 
prior literature has many interesting characterizations of C being Ex-learnable 
from noise-free positive data, with and without extra restrictions. See, for exam- 
ple, IAng80IMuk9!jlLZK9fildJK9fil . 
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be noise-tolerantly Bc-learned from positive data with the learner outputting 
grammars) iff 

(VL,L'e£)[LCL'=^L = L']. (2) 

is easily checkable (as is HJ above, but, O is more restrictive, as we saw in 
the just previous paragraph). 

In a completely computable universe, all data sequences, even noisy ones, 
are computable. In the present paper, we are concerned with learner-synthesizer 
algorithms which operate on procedures which list either grammars or decision 
procedures but, significantly, we restrict the noisy data sequences to being com- 
putable. 

Herein, our main and surprising result (Theorem 0 in Section H.ll below) is: 
there is a learner-synthesizer algorithm Isyn so that, if Isyn is fed any procedure 
that lists decision procedures defining any indexed family C of languages, then 
Isyn outputs a learner which, from computable, noisy, positive data on any L G C, 
outputs a sequence of grammars eventually all correct for L\ This result has the 
following corollary (Corollary ^ in Section 14.11 below) : for every indexed family 
C, there is a machine for Bc-learning C, where the machine outputs grammars 
and the input is computable noisy positive data! Essentially Theorem Q is a 
constructive version of this corollary: not only can each indexed family be Bc- 
learned (outputting grammars on computable noisy positive data), but one can 
algorithmically find a corresponding Bc-learner (of this kind) from an index for 
any indexed family! As a corollary to Theorem I3we have that the class of finite 
unions of pattern languages is Bc-learnable from computable noisy texts, where 
the machine outputs grammars (this contrasts sharply with the negative result 
mentioned above from |CJS98aj that even the class of pattern languages is not 
learnable from unrestricted noisy texts)! 

Another main positive result of the present paper is Corollary 0 in Section E~TI 
below. It says that an indexed family C can be Bc-learned from computable noisy 
informant data by outputting grammars iff 

(VL e C){3z){yL' gC\{x<z\xGL} = {x<z\xG L'})[L' C L]. (3) 

Corollary El in the same section is the constructive version of Corollary Eland says 
one can algorithmically find such a learner from an index for any indexed family 
so learnable. (E) is easy to check too and intriguingly differs slightly from the 
characterization in | kilf(98a| of the same learning criterion applied to indexed 
families but with the noisy data sequences unrestricted: 

(VL G £)(3z)(VL' gC\{x<z\xGL} = {x<z\xG L'})[L' = L]. (4) 

Let N denote the set of natural numbers. Then {L \ card(A^ — L) is finite } 
satisfies O, but not 0)! However, C = the class of all unions of two pattern 
languages satisfies neither (EJ nor 0). 

As might be expected, for several learning criteria considered here and in 
previous papers on synthesis, the restriction to computable noisy data sequences 
may, in some cases, reduce a criterion to one previously studied, but, in other 
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cases (e.g., the one mentioned at the end of the just previous paragraph), not. 
Section 0 below, then, contains many of the comparisons of the criteria of this 
paper to those of previous papers. 

As we indicated above. Section B~Tl be]ow contains the main results of the pre- 
sent paper, and, in general, the results of this section are about synthesis from 
indices for indexed families and, when appropriate, corresponding characteriza- 
tions. Section lO below contains our positive and negative results on synthesis 
from r.e. indices for r.e. classes. 

Finally Section 0 gives some directions for further research. 

2 Preliminaries 

2.1 Notation and Identification Criteria 

The recursion theoretic notions are from the books of Odifreddi and 

Soare |Soa87|. N = {0, 1,2,.. .} is the set of all natural numbers, and this paper 
considers r.e. subsets L of N. TV'*' = {1,2,3,.. .}, the set of all positive integers. 
All conventions regarding range of variables apply, with or without decoration^, 
unless otherwise specified. We let c, e, i, j, k, I, m, n, q, s, t, u, v, w, x, y, z, 
range over N. 0,G,C,D,c,D denote empty set, member of, subset, superset, 
proper subset, and proper superset respectively. max(), min(), card() denote the 
maximum, minimum, and cardinality of a set respectively, where by convention 
max(0) = 0 and min(0) = oo. card(S') < * means cardinality of set S is finite. 
a,b range over N U {*}. (•,•) stands for an arbitrary but fixed, one to one, 
computable encoding of all pairs of natural numbers onto N. (•,•,•), similarly 
denotes a computable, 1-1 encoding of all triples of natural numbers onto N. 
L denotes the complement of set L. \l denotes the characteristic function of 
set L. L 1 AL 2 denotes the symmetric difference of Li and L 2 , i.e., L 1 AL 2 = 
(Li — L 2 ) U {L 2 — Li). Li =“ L 2 means that card(LiAL 2 ) < a. Quantifiers 
V°°,3°°, and 3! denote for all but finitely many, there exist infinitely many, and 
there exists a unique respectively. 

TZ denotes the set of total computable functions from N to N. f,g, range over 
total computable functions. £ denotes the set of all recursively enumerable sets. 
L ranges over £. C ranges over subsets of £. kp denotes a standard acceptable 
programming system (acceptable numbering), ipi denotes the function computed 
by the i-th program in the programming system Lp. We also call i a program or 
index for ipi. For a (partial) function 77 , domain(? 7 ) and range(r 7 ) respectively 
denote the domain and range of partial function 77 . We often write ri{x)\. {r]{x)^) 
to denote that r]{x) is defined (undefined). Wi denotes the domain of (pi. Wi is 
considered as the language enumerated by the i-th program in ip system, and we 
say that i is a grammar or index for Wi. denotes a standard Blum complexity 
measure EM for the programming system (p. Wi^s = (x < s | <?i(a;) < s}. 

A text is a mapping from N to We let T range over texts. content(T) 

is defined to be the set of natural numbers in the range of T (i.e. content(T) = 

® Decorations are subscripts, superscripts, primes and the like. 
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range(r) — {#}). T is a text for L iff content(r) = L. That means a text for L 
is an infinite sequence whose range, except for a possible is just L. 

An information sequence or informant is a mapping from N to (N x {0, 1}) U 
{#}• We let I range over informants. content(I) is defined to be the set of pairs 
in the range of I (i.e. content(J) = range(/) — {#})• An informant for L is 
an informant I such that content(/) = {{x,b) \ Xl{x) = b}. It is useful to 
consider the canonical information sequence for L. I is a canonical information 
sequence for L iff I{x) = {x,xl(x)). We sometimes abuse notation and refer to 
the canonical information sequence for L by xl- 

a and r range over finite initial segments of texts or information sequences, 
where the context determines which is meant. We denote the set of finite initial 
segments of texts by SEG and set of finite initial segments of information sequen- 
ces by SEQ. We use a <T (respectively, a < I, a <t) to denote that cr is an 
initial segment of T (respectively, /, r). \a\ denotes the length of a. T[n] denotes 
the initial segment of T of length n. Similarly, I[n\ denotes the initial segment 
of I of length n. Let T\m : n] denote the segment T{m),T{m J- 1), . . . , T{n — 1) 
(i.e. T[n] with the first m elements, T[m], removed). I[m : n] is defined similarly. 
a o T (respectively, a oT, a o I) denotes the concatenation of a and r (respec- 
tively, concatenation of cr and T, concatenation of a and I). We sometimes abuse 
notation and say cr o re to denote the concatenation of cr with the sequence of 
one element w. 

A learning machine M is a mapping from initial segments of texts (informa- 
tion sequences) to N . We say that M converges on T to q (written: M(T)4, = i) 
iff, for all but finitely many n, M(r[n]) = i. If there is no i such that M(T)4, = i, 
then we say that M diverges on T (written: M(T)t). Convergence on information 
sequences is defined similarly. 

Let ProgSet(M, cr) = {M(r) | r C a}. 



Definition 1 Suppose a,b G N U {*}. 

(a) Below, for each of several learning criteria J, we define what it means for a 
machine M to 3 -identify a language L from a text T or informant I. 

M TxtEx“-identi/ies L from text T iff (3i | Wi =“ 






L)[M(T)i = i]. 

• fGolbVIG!,X2| M InfEx“ -identifies L from informant I iff (3* | Wi =“ 
L) mi)i = i]. 

• M TxtBc“-j(ientj/res L from text T iff (V°“n)[W]v[fTlral'i =“ L], 

• |R74HI,S2| . M InfBc“ -identifies L from informant I iff 

(V-n)[WM(/[„]) =“ L]. 

(b) Suppose J € {TxtEx“, TxtBc“}. M 3-identifies L iff, for all texts T for L, 
M J-identifies L from T. In this case we also write L G J(M). 

We say that M J-identifies £ iff M J-identifies each L G C. 



J = {£ I (3M)[£ C J(M)]}. 

(c) Suppose J G {InfEx“, InfBc“}. M 3-identifies L iff, for all information 
sequences I for L, M J-identifies L from I. In this case we also write L G J(M). 
We say that M J-identifies £ iff M J-identifies each L G C. 



J = {£ I (3M)[£ C J(M)]}. 
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We often write TxtEx° as TxtEx. A similar convention applies to the other 
learning criteria of this paper. 

Next we prepare to introduce our noisy inference criteria, and, in that in- 
terest, we define some ways to calculate the number of occurrences of words in 
(initial segments of) a text or informant. For a G SEG, and text T, let 

occur((T, w) card({j | j < \a\ A cr(j) = w}) and 

Hpf 

occur(T, w) = card({j \ j G N A T{j) = re}). 

For a G SEQ and information sequence I, occur(-,-) is defined similarly 
except that w is replaced by {v,b). 

For any language L, occur(T, L) occur(T, x). It is useful to introduce 

the set of positive and negative occurrences in (initial segment of) an informant. 
Suppose a G SEQ 

PosInfo((r) {v \ occur(cr, (u, 1)) > occur(cr, ("u,0)) A occur(cr, (v, 1)) > 1} 

def 

NegInfo((r) = {■(; | occur(tj, (v, 1)) < occur(cr, (v, 0)) A occur(cr, (v, 0)) > 1} 

That means, that Poslnfo(cr) UNeglnfo(cr) is just the set of all v such that either 
(v, 0) or (v, 1) occurs in a. Then v G Poslnfo(cr) if (v, 1) occurs at least as often 
as (v, 0) and v G Neglnfo(tj) otherwise. Similarly, 

PosInfo(J) = {v I occur(/, {v, 1)) > occur(J, {v, 0)) A occur(/, {v, 1)) > 1} 

NegInfo(/) = {v \ occur(/, (■;;, 1)) < occur(J, {v, 0)) A occur(J, (u, 0)) > 1} 

where, if occur(J, (1^,0)) = occur(J, {v, 1)) = oo, then we place v in PosInfo(/) 
(this is just to make the definition precise; we will not need this for criteria of 
inference discussed in this paper). 

Definition 2 [Stehb) An information sequence / is a noisy information se- 
quence (or noisy informant) for L iff (Va:) [occur(J, (a;, Xi(a:))) = oo A 

occur(/, (a;, x;j^(a;))) < oo]. A text T is a noisy text for L iff (Va: G 

L) [occur (T, a;) = oo] and occur(T, L) < oo. 

On the one hand, both concepts are similar since L = {x \ occur(/, (a:, 1)) = 
oo} = {x \ occur(T, x) = oo}. On the other hand, the concepts differ in the 
way they treat errors. In the case of informant every false item (x, x^(x)) may 
occur a finite number of times. In the case of text, it is mathematically more 
interesting to require, as we do, that the total amount of false information has 
to be finitely 

Definition 3 IMtehftk Suppose a G A^U{*}. Suppose J G 

{TxtEx“, TxtBc“}. Then M Noisy J-identifies L iff, for all noisy texts T for 
L, M 3 -identifies L from T. In this case we write L G NoisyJ(M). 

M Noisy J-identifies a class £ (jff M 'NoisyJ -identifies each L G C. 

NoisyJ = {£ 1 (3M)[£ C NoisyJ(M)]}. 

® As we noted in Section Q above, the alternative of allowing each false item in a text 
to occur finitely often is too restrictive; it would, then, be impossible to learn even 
the class of all singleton sets r^t^ . 
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Inference criteria for learning from noisy informants are defined similarly. 

Note that in all the learning criteria formally defined thus far in this section, 
the (possibly noisy) texts or informants may be of arbitrary complexity. In a 
completely computable universe all texts and informants (even noisy ones) must 
be recursive (synonym: computable). As noted in Section ^ above, this motivates 
our concentrating in this paper on recursive texts and informants. 

When a learning criterion is restricted to requiring learning from recursive 
texts/informants only, then we name the resultant criteria by adding, in an ap- 
propriate spot, ‘Rec’ to the name of the unrestricted criterion. For example, 
RecTxtEx-identification is this restricted variant of TxtEx-identification. For- 
mally, RecTxtEx-identification may be defined as follows. 

Definition 4 M RecTxtEx“-identifies L iff, for all recursive texts T for L, 
M -identifies L from T . 

One can similarly define RecInfEx“, RecTxtBc“, RecInfBc“, 

NoisyRecTxtEx“, NoisyRecTxtBc“, NoisyRecInfEx“, 

Noisy RecInfBc“. 

RecTxtBc“ TxtBc“ |OL82IFre85| : however, TxtEx“ = RecTxtEx“ 
lfifi7,»;iWie77i.^^ . lO.ISfl^hl showed that, for a G N U {*}, NoisyInfBc“ U 
NoisyTxtBc“ C TxtBc“ and NoisyInfEx“ U NoisyTxtEx“ C TxtEx“. 
The proof of the above also shows: Noisy RecInfBc“ U NoisyRecTxtBc“ C 
RecTxtBc“ and NoisyRecInfEx“ U NoisyRecTxtEx“ C RecTxtEx“. In 
Section 0 below, we indicate the remaining comparisons. 

2.2 Recursively Enumerable Classes and Indexed Families 

This paper is about the synthesis of algorithmic learners for r.e. classes of r.e. lan- 
guages and of indexed families of recursive languages. To this end we define, for 

def 

all i, Ci = {W j I j G Wi}. Hence, Ci is the r.e. class with index i. For a decision 

def 

procedure j, we let Uj = {x \ <Pj{x) = 1}. For a decision procedure j, we let 
Uj [n] denote {x GUj \ x < n}. For all i, 

^ { {Uj \ j GWi}, if (Vj G Wi)[j is a decision procedure]; 

* ( 0, otherwise. 

Hence, Ui is the indexed family with index i. 

3 Comparisons 

In this section we consider the comparisons between the inference criteria in- 
troduced in this paper among themselves and with the related inference criteria 
from the literature. We omit the proofs of these theorems due to lack of space. 

The next theorem says that for Bc*-learning, with computable noise, from 
either texts or informants, some machine learns grammars for all the r.e. langu- 
ages. It improves a similar result from jCL82j for the noise-free case. 
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Theorem 1. (a) £ € Noisy RecTxtBc*. 

(b) £ € NoisyRecInfBc*. 

The next result says that for Ex-style learning with noisy texts or informants, 
restricting the data sequences to be computable does not help us. 

Theorem 2. Suppose a G N U {*}. 

(a) NoisyTxtEx“ = Noisy RecTxtEx“. 

(b) NoisyInfEx“ = Noisy RecInfEx“. 



Theorem 3. Suppose n G N. 

(a) NoisyTxtEx — NoisyRecInfBc” yf 0. 

(b) NoisylnfEx — Noisy RecTxtBc" yf 0. 



Theorem 4. Suppose n G N. (a) NoisyTxtBc"^^ — RecInfBc" yf 0. 

(b) NoisyInfBc”+^ - RecInfBc” yf 0. 

Theorem 5. (a) NoisyRecTxtBc — TxtBc* yf 0. 

(b) NoisyRecInfBc — TxtBc* yf 0. 

It is open at present whether, for m <n, (i) NoisyRecTxtBc™ — InfBc” y^ 
0? and whether (ii) NoisyRecIniBc™ — InfBc" yf 0? In this context note that 



Theorem 6. RecTxtBc” fl 2^ C InfBc”. 

4 Principal Results on Synthesizers 

Since £ G NoisyRecTxtBc* and £ G NoisyRecInfBc*, the only cases of inte- 
rest are regarding when NoisyRecTxtBc" and NoisyRecInfBc” synthesizers 
can be obtained algorithmically. 



4.1 Principal Results on Synthesizing from Uniform Decision 
Indices 

The next result is the main theorem of the present paper. 

Theorem 7. (3/ G 7Z)(Vi)[l/i C NoisyRecTxtBc(My(q)]. 

Proof. Let Myp) be such that, My(q(T[n]) = ]jrog{T[n]), where, Wprog(T[n]) is 
defined below. Construction of prog will easily be seen to be algorithmic in i. 

If I4i is empty, then trivially NoisyRecTxtBc-identifies I4i. So sup- 

pose Ui is nonempty (in particular, for all j G Wi, ipj is a decision procedure). 
In the construction below, we will thus assume without loss of generality that, 
for each j G Wi, (pj is a decision procedure. 
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Let g be a computable function such that, range (g) = {{j, k) \ j GWt A k £ 
N}. Intuitively, for an input noisy recursive text T for a language L, think of m 
such that g{m) = (j,k) as representing the hypothesis: (i) L = Uj, (ii) (fk = T, 
and (iii) T[m : oo] does not contain any element from L. In the procedure below, 
we just try to collect “non-harmful” and “good” hypothesis in Pn and (more 
details on this in the analysis of prog{T[n]) below). Let PI and P2 be recursive 
functions such that g{m) = (Pl(?n), P2(m)). 

^^prog(T[n]) 

1. Let Pn = {m \ m < n} — [{m \ content(r[m : n]) 2 f^pi(m)} U {m \ {3k < 
n)[^P 2 {m){k) < n A (pp 2 {m){k) ^ P(fc)]}]- 
(* Intuitively, Pn is obtained by deleting m < n which represent a clearly 
wrong hypothesis. *) 

(* Qn below is obtained by refining Pn so that some further properties are 
satisfied. *) 

2 Let Q° = Pn. 

Go to stage 0. 

3. Stage s 

3.1 Enumerate flrnGQ* 

3.2 Let = Qn — {m' \ {3m” G Qn){3k < s)[m" < m' < k A 

[^P2(m")(^) < S A PP2(m"){k) ^ Gpi(m')]]}- 

3.3 Go to stage s + 1. 

End stage s. 

End 

Let T be a noisy text for L £ Ui. Let m be such that Gpi(m) = L, T[m : oo] 
is a text for L, and 'Pp 2 {m) = T. Note that there exists such an m (since Lp 
is acceptable numbering, and T is a noisy recursive text for L). Gonsider the 
definition of Wprog(T[n]) for n G iV as above. 

Claim. For all m' < m, for all but finitely many n, if m' £ Pn then 

(a) L C Upi(rn^), and 

(b) (Vfc)[v?p2(m')(fc)t V p>P2{m'){k) =T{k)]. 

Proof. Suppose m' < m. 

(a) If ILpi(m') 2 L, then there exists a fc > m' such that T{k) ^ Wpi(m')- 
Thus, for n > k, m' ^ Pn. 

(b) If there exists a k such that [</?P2(m') T{k)], then for all n > 

max({A:,<?P 2 ( )(A:)}), m' Pn. 

The claim follows. □ 

Claim. For all but finitely many n: m £ Pn. 

Proof. For n > m, clearly m £ Pn. □ 

Let rig be such that, for all n > no, (a) m G and (b) for all m' < m, if 
m' £ Pn, then L C Wpi(^m') and (Vfc)[</?p 2 (,n/)(fc)t V ipp 2 (m'){k) = T{k)]. (There 
exists such a ng by Glaims I01 and im 1 
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Claim. Consider any n > tiq. Then, for all s, we have m S Q® . It follows that 

^^prog(T[n]) — 

Proof. Fix n > uq. The only way m can be missing from Q® , is the existence 
of to" < TO, and t > m such that to" S P„, and ‘^P2(m")(^)i ^ But then 
to" ^ by the condition on no- Thus to S Q^, for all s. □ 

Claim. Consider any n> no. Suppose to < to' < n. If (3°°s)[m' G Q® ], then L C 
C/pi(m')- Note that, using the condition on no, this claim implies L C Wprog{T[n])- 

Proof. Fix any n > uq. Suppose (3°°s)[m' G Q®]. Thus, (Vs) [to' G Q®]. Sup- 
pose L % C/pi(m')- Let y G L — C/pi(m,/). Let fc > to' be such that T{k) = y. 
Note that there exists such a k, since y appears infinitely often in T. But then 
</5p2(m)(fc)i ^ UY>\[rn')- This would imply that to' ^ Q® , for some s, by step 3.2 
in the construction. Thus, L C f7pi(m')) and claim follows. □ 

From Claims O and 14.1 l it follows that, for n > no, Wprog(T[n\) = L. Thus, 
Noisy RecTxtBc-identifies Wi. | 



Corollary 1 Every indexed family belongs to NoisyRecTxtBc. 

As noted in Section Q above, then, the class of finite unions of pattern languages 
is Noisy RecTxtB c-learnable! 

Remark 1. In the above theorem, learnability is not obtained by learning the 
rule for generating the noise. In fact, in general, it is impossible to learn (in 
the Bc-sense) the rule for noisy text generation (even though the noisy text is 
computable)! 

While the NoisyRecTxtBc“-hierarchy collapses for indexed families, we see 
below that the Noisy RecInfBc“-hierarchy does not so collapse. 

Definition 5 Inf [S', L] {r [ (Vx G S) [occur(r, = 0]}. 



Lemma 1. Let n G N. 

(a) Suppose L is a recursive language, and M NoisyRecInfBc"'-idenS/ies L. 
Then there exists a a and z such that (Vr G Inf[{x | x < z}, L])[card(WM(o-OT) ~ 
L) < n]. 

(b) Suppose C is an indexed family in Noisy RecInfBc". Then, for all L G C, 
there exists a z such that, for all L' G L, [({x <z|xGL} = {x<z|xG 
L'Y) => (card(L' — L) < 2n)]. 

We omit the proof due to lack of space. An easy application of above lemma 
yields the following theorem. 

Theorem 8. Suppose n G N. {L \ card(L) < 2(n 3- 1)} G NoisylnfBc"^^ — 
Noisy RecInfBc" . 
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We will see in Corollary |2] below that it is possible to algorithmically synthe- 
size learners for NoisyRecInfB c-learnable indexed families. 

Theorem 9. There exists f G TZ such that the following is satisfied. Suppose 
(VL G Ui){3z){VL' G Ui)[{{x <z\xGL} = {x<z\xG L'}) L' C L], Then, 

G NoisyRecInfBc(My(i))]. 

As a corollary to Lemma mb) and Theorem 0 we have the second main, 
positive result of the present paper: 

Corollary 2 (3/ G 72.) (Vi | Ui G Noisy RecInfBc) [74 C 
Noisy RecInfBc(M /(i) )] . 

The following corollary to Lemma^b) and Theorem E| p rovides the very nice 
characterization of indexed families in Noisy RecInfBc|3 

Corollary 3 74 G NoisyRecInfBc for all L G Ui, there exists a z such that, 
for all L' G Ui, [({x <z\xGL} = {x<z\xG L'}) => L' C L]. 

For n > 0, we do not know about synthesizing learners for Ui G 

NoisyRecInfBc" . 

4.2 Principal Results on Synthesizing from R.E. Indices 

Theorem 10. -■(3/ G 72) (Vi | Ci G NoisyTxtEx fl Noisy InfEx)[Ci C 
RecTxtBc" (M )] . 

Proof Theorem 17 in fC.IS98aj showed -'(3/ G 72) (Vi | Ci G NoisyTxtEx fl 
NoisyInfEx) [Ci C TxtBc"(Mj( 2 ,))]. The proof of this given in jCJS98aj 
also shows that “'(3/ G 72) (Vi | Ci G NoisyTxtEx fl NoisyInfEx) [C^ C 
RecTxtBc"(Mj(„))]. | 



Corollary 4 -■(3/ G 72) (Vi | Ci G NoisyTxtEx fl NoisyInfEx) [Ci C 

NoisyRecTxtBc"(M_^(a;))] . 

Corollary 5 “'(3/ G TZ){\/i \ Ci G NoisyTxtEx fl NoisyInfEx) [Ci C 

NoisyRecInfBc" (M )] . 

5 Conclusions and Future Directions 

In a completely computable universe, all data sequences, even noisy ones, are 
computable. Based on this, we studied in this paper the effects of having compu- 
table noisy data as input. In addition to comparing the criteria so formed within 
themselves and with related criteria from the literature, we studied the problem 



Hence, as was noted in Section Q above, we have: {L \ card(A — L) is finite } G 

(NoisyRecInfBc — NoisyInfBc). 
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of synthesizing learners for r.e. classes and indexed families of languages. The 
main result of the paper (Theorem |2|) showed that all indexed families of langu- 
ages can be learned (in Bc-sense) from computable noisy texts. Moreover, one 
can algorithmically find a learner doing so, from an index for any indexed family! 
Another main positive result of the paper. Corollary 0 gives a characterization 
of indexed families which can be learned (in Bc-sense) from computable noisy 
informant. In addition to the results presented in the paper, we have related 
results for synthesis in the case of computable data (texts and informants) when 
there is no noise in the input. Due to lack of space we omit these results. 

It is interesting to extend the study to the case where the texts have 
some other restriction than the computability restriction we considered in 
this paper. In this regard we have considered limiting recursive texts FH 
One of the results we have here is that TxtBc = LimRecTxtBc and 
NoisyTxtBc = LimRecNoisyTxtBc (where the LimRec in LimRecTxtBc 
and LimRecNoisyTxtBc denotes that identification is supposed to be from li- 
miting recursive texts, noise-free and noisy, respectively). One can also similarly 
consider texts from natural subrecursive classes fROMj . linear-time computable 
and above. From fOo]B7IOas86| . in that setting, some machine learns £ . Howe- 
ver, it remains to determine the possible tradeoffs between the complexity of the 
texts and useful complexity features of the resultant learners. mentions 

that, in some cases, subrecursiveness of texts forces infinite repetition of data. 
Can this be connected to complexity tradeoffs? [K las86) further notes that, if the 
texts we present to children, contain many repetitions, that would be consistent 
with a restriction in the world to subrecursive texts. 



References 

AngSO. D. Angluin. Inductive inference of formal languages from positive data. In- 
formation and Control, 45:117-135, 1980. 

AS94. H. Arimura and T. Shinohara. Inductive inference of Prolog programs with 
linear data dependency from positive data. In Proc. Information Modelling 
and Knowledge Bases V, pages 365-375. lOS Press, 1994. 

ASY92. S. Arikawa, T. Shinohara, and A. Yamamoto. Learning elementary formal 
systems. Theoretical Computer Science, 95:97-113, 1992. 

B74. J. Barzdins. Two theorems on the limiting synthesis of functions. In Theory 
of Algorithms and Programs, vol. 1, pages 82-88. Latvian State University, 
1974. In Russian. 

BB75. L. Blum and M. Blum. Toward a mathematical theory of inductive inference. 
Information and Control, 28:125-155, 1975. 

BCJ96. G. Baliga, J. Case, and S. Jain. Synthesizing enumeration techniques for 
language learning. In Proceedings of the Ninth Annual Conference on Com- 
putational Learning Theory, pages 169-180. ACM Press, 1996. 

The limiting recursive texts are in between the computable and the arbitrarily un- 
computable. Informally, they are the ones computed by limiting-programs, programs 
that “change their minds” finitely many times about each output before getting it 
right pha71ISoa87|. 



218 



J. Case and S. Jain 



Ber85. R. Berwick. The Acquisition of Syntactic Knowledge. MIT Press, 1985. 

Blu67. M. Blum. A machine-independent theory of the complexity of recursive fun- 
ctions. Journal of the ACM, 14:322-336, 1967. 

Cas86. J. Case. Learning machines. In W. Demopoulos and A. Marras, editors. Lan- 
guage Learning and Concept Acquisition. Ablex Publishing Company, 1986. 

Cas96. J. Case. The power of vacillation in language learning. Technical Report 
LP-96-08, Logic, Philosophy and Linguistics Series of the Institute for Lo- 
gic, Language and Computation, University of Amsterdam, 1996. To appear 
revised in SLAM Journal on Computing. 

CJS98a. J. Case, S. Jain, and A. Sharma. Synthesizing noise-tolerant language lear- 
ners. Theoretical Computer Science A, 1998. Accepted. 

CJS98b. J. Case, S. Jain, and F. Stephan. Vacillatory and BC learning on noisy data. 
Theoretical Computer Science A, 1998. Accepted. 

CL82. J. Case and C. Lynes. Machine inductive inference and language identifica- 
tion. In M. Nielsen and E. M. Schmidt, editors, Proceedings of the 9th Inter- 
national Colloquium on Automata, Languages and Programming, volume 140 
of Lecture Notes in Computer Science, pages 107-115. Springer- Verlag, 1982. 

CS83. J. Case and C. Smith. Comparison of identification criteria for machine 
inductive inference. Theoretical Computer Science, 25:193-220, 1983. 

dJK96. D. de Jongh and M. Kanazawa. Angluin’s thoerem for indexed families of 
r.e. sets and applications. In Proceedings of the Ninth Annual Conference on 
Computational Learning Theory, pages 193-204. ACM Press, July 1996. 

Fre85. R. Freivalds. Recursiveness of the enumerating functions increases the infer- 
rability of recursively enumerable sets. Bulletin of the European Association 
for Theoretical Computer Science, 27:35-40, 1985. 

Ful90. M. Fulk. Robust separations in inductive inference. 31st Annual IEEE Sym- 
posium on Foundations of Computer Scienee, pages 405-410, 1990. 

Gol67. E. M. Gold. Language identification in the limit. Information and Control, 
10:447-474, 1967. 

Jan79a. K. Jantke. Automatic synthesis of programs and inductive inference of fun- 
ctions. In Int. Conf. Fundamentals of Computations Theory, pages 219-225, 
1979. 

Jan79b. K. Jantke. Natural properties of strategies identifying recursive functions. 
Eleetronische Informationverarbeitung und Kybernetik, 15:487-496, 1979. 

Kap91. S. Kapur. Computational Learning of Languages. PhD thesis, Cornell Uni- 
versity, 1991. 

KB92. S. Kapur and G. Bilardi. Language learning without overgeneralization. In 
Proceedings of the Ninth Annual Symposium on Theoretieal Aspeets of Com- 
puter Science, volume 577 of Leeture Notes in Computer Seience. Springer- 
Verlag, 1992. 

LD94. N. Lavarac and S. Dzeroski. Inductive Logic Programming. Ellis Horwood, 
New York, 1994. 

LZK96. S. Lange, T. Zeugmann, and S. Kapur. Monotonic and dual monotonic lan- 
guage learning. Theoretical Computer Scienee A, 155:365-410, 1996. 

MR94. S. Muggleton and L. De Raedt. Inductive logic programming: Theory and 
methods. Journal of Logic Programming, 19/20:669-679, 1994. 

Muk92. Y. Mukouchi. Characterization of finite identification. In K. Jantke, editor. 
Analogical and Inductive Inference, Proeeedings of the Third International 
Workshop, pages 260-267, 1992. 

Nix83. R. Nix. Editing by examples. Technical Report 280, Department of Computer 
Science, Yale University, New Haven, CT, USA, 1983. 




Synthesizing Learners Tolerating Computable Noisy Data 219 



Odi89. 

OSW88, 

RC94. 

SA95. 

Sha71. 

Shi83. 

Shi91. 

Smu61. 

Soa87. 

SSS+94. 

Ste95. 

Wie77. 

Wri89. 

ZL95. 

ZLK95. 



P. Odifreddi. Classical Recursion Theory. North-Holland, Amsterdam, 1989. 
D. Osherson, M. Stob, and S. Weinstein. Synthesising inductive expertise. 
Information and Computation, 77:138-161, 1988. 

J. Royer and J. Case. Subrecursive programming systems: Complexity & suc- 
cinctness. Birkhauser, 1994. 

T. Shinohara and A. Arikawa. Pattern inference. In Klaus P. Jantke and Stef- 
fen Lange, editors. Algorithmic Learning for Knowledge-Based Systems, vo- 
lume 961 of Lecture Notes in Artifieial Intelligence, pages 259-291. Springer- 
Verlag, 1995. 

N. Shapiro. Review of “Limiting recursion” by E.M. Gold and “Trial and 
error predicates and the solution to a problem of Mostowski” by H. Putnam. 
.Journal of Symbolic Logic, 36:342, 1971. 

T. Shinohara. Inferring unions of two pattern languages. Bulletin of Infor- 
matics and Cybernetics, 20:83-88., 1983. 

T. Shinohara. Inductive inference of monotonic formal systems from positive 
data. New Generation Computing, 8:371-384, 1991. 

R. Smullyan. Theory of Formal Systems, Annals of Mathematical Studies, 
No. fl. Princeton, NJ, 1961. 

R. Soare. Recursively Enumerable Sets and Degrees. Springer- Verlag, 1987. 

S. Shimozono, A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara, and S. Ari- 
kawa. Knowledge acquisition from amino acid sequences by machine learning 
system BONSAI. Trans. Information Processing Society of Japan, 35:2009- 
2018, 1994. 

F. Stephan. Noisy inference and oracles. In Algorithmic Learning Theory: 
Sixth International Workshop (ALT ’95), volume 997 of Lecture Notes in 
Artificial Intelligence, pages 185-200. Springer- Verlag, 1995. 

R. Wiehagen. Identification of formal languages. In Mathematical Foundati- 
ons of Computer Science, volume 53 of Lecture Notes in Computer Science, 
pages 571-579. Springer- Verlag, 1977. 

K. Wright. Identification of unions of languages drawn from an identifiable 
class. In R. Rivest, D. Haussler, and M.K. Warmuth, editors. Proceedings 
of the Second Annual Workshop on Computational Learning Theory, Santa 
Cruz, California, pages 328-333. Morgan Kaufmann Publishers, Inc., 1989. 

T. Zeugmann and S. Lange. A guided tour across the boundaries of learning 
recursive languages. In K. Jantke and S. Lange, editors. Algorithmic Lear- 
ning for Knowledge-Based Systems, volume 961 of Lecture Notes in Artificial 
Intelligence, pages 190-258. Springer- Verlag, 1995. 

T. Zeugmann, S. Lange, and S. Kapur. Characterizations of monotonic and 
dual monotonic language learning. Information and Computation, 120:155- 
173, 1995. 




Characteristic Sets for Unions of 
Regular Pattern Languages and Compactness 



Masako Sato, Yasuhito Mukouchi, and Dao Zheng 

Department of Mathematics and Information Sciences 
College of Integrated Arts and Sciences 
Osaka Prefecture University, Sakai, Osaka 599-8531, Japan 
e-mail: {sato, mukouchi}@mi. cias.osakafu-u.ac.jp 



Abstract. The paper deals with the class TZV^ of sets of at most k 
regular patterns. A semantics of a set P of regular patterns is a union 
L{P) of languages defined by patterns in P. A set Q of regular patterns 
is said to be a more general than P, denoted by P C Q, if for any 
p G P, there is a more general pattern q in Q than p. It is known 
that the syntactic containment P U Q for sets of regular patterns is 
efficiently computable. We prove that for any sets P and Q in TZV^, 
(i) S2{P) C L{Q), (ii) the syntactic containment P U Q and (iii) the 
semantic containment L{P) C L{Q) are equivalent mutually, provided 
ttlf > 2k — 1, where Sn{P) is the set of strings obtained from P by 
substituting strings with length at most n for each variable. The result 
means that 52 (P) is a characteristic set of L{P) within the language 
class for TZV^ under the condition above. Arimura et al. showed that the 
class TZP'^ has compactness with respect to containment, if > 2k + l. 
By the equivalency above, we prove that TZV^ has compactness if and 
only if ttY > 2fc — 1. 

The results obtained enable us to design efficient learning algorithms 
of unions of regular pattern languages such as already presented by 
Arimura et al. under the assumption of compactness. 



1 Introduction 

A pattern is a string consisting of constant symbols in a fixed alphabet S and 
variables. For example, p = axbx is a pattern, where a and b are constant 
symbols, and a; is a variable. The language L{p) defined by a pattern p is the set 
of constant strings obtained from the pattern by substituting nonempty constant 
strings for variables in p. For example, the language defined by the above pattern 
is L{p) = {awbw \ w G Y+l. 

The class VC oi pattern languages was introduced by Angluinj^ as a class 
inductively inferable from positive data based on identification in the limit due 
to Gold[Z|. The class VC is one of the most basic class in the framework of 
elementary formal systems which was introduced by Smufiva.nfl 3^ to develop a 
new theory of recursive functions, and was proposed as a unifying framework 
for language learning by Arikawa et al.|3|. That is, an elementary formal system 
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consisting of only one definite clause defines a pattern language. In some prac- 
tical applications such as genome informatics, pattern languages are paid much 
attentions (cf. Arikawa et al.^). 

Angluin|2| showed that the class VC has a property of so-called finite thick- 
ness. Wright m introduced a notion of finite elasticity for a language class, 
which is a natural extension of that of finite thickness, and showed that a class 
with finite elasticity is inferable from positive data and moreover, the property 
is closed under union operation. As a result, it was shown that for any fixed fc, 
the class VC^ of unions of at most k pattern languages is also inferable from 
positive data. On the other hand, Sato^J introduced a notion of finite cross 
property characterizing a class with finite elasticity. The property of finite cross 
property is closely related with a characteristic set. A nonempty finite set S of 
strings is said to be a characteristic set of a language L within a class £, when 
L is the least language within C containing the set S. We show that a language 
L has a finite cross property within C if and only if there is a characteristic set 
of L within the class C. Thus if a class C has a finite elasticity, then for any 
language L G C, there is a characteristic set of L within C. 

Let I? be a set of descriptions which can be partially ordered by an effectively 
computable relation C, and let C = {L{P) \ P GX>}he the language class defined 
by descriptions of T>. We assume that the syntactic containment P C Q implies 
the semantic containment L{P) C L{Q) for any P,Q gT>. Assume that the class 
C has finite elasticity. Thus for any description P gT>, there is a characteristic 
set of L{P) within C. If the problem of finding one of the characteristic sets and 
the membership problem for languages in C are efficiently computable, then the 
containment for languages of C is also efficiently computable. Furthermore, if the 
semantic containment L{P) C L(Q) implies the syntactic containment P Q Q, 
the containment for languages of C is efficiently computable. 

A pattern is said to be regular, if each variable in the pattern appears at 
most once. In this paper, we deal with the class TZV'^ of sets of at most k regular 
patterns as a class of descriptions, and develop the above discussion for TZP^ . A 
pattern g is a generalization of a pattern p, denoted by p ^ g, when q is obtained 
from p by substituting patterns for variables in p. For example, a pattern q = axy 
is a generalization of a pattern p = axbx, i.e., axhx -< axy. The set V of patterns 
is a partially ordered set under the relation provided we identify patterns 
obtained by renaming variables. Clearly the syntactic containment p ^ g implies 
the semantic containment L(p) C L(g), but not always the converse. Mukouchi 
showed that the converse is valid for the class TZV of regular patterns. 

A set P = {pi, • • • ,p„} of regular patterns defines a language L{P) — L{pi) LI 
• • • U L{pn). Let TZPC^ be the class defined by the description class TZV^ . For 
sets P,QG TZV^ , we define a relation C as follows: P C Q if and only if for any 
p G P, there is a regular pattern q G Q such that p ^ g. Clearly P LQ implies 
L{P) C L{Q). The relation C is an efficiently computable and partially ordered 
relation by restricting to sets in of canonical form (cf. Arimura et al.@). 

The class TZVC^ as well as the class VC have finite elasticity. Thus for each 
P G TZV^ , there is a characteristic set of P(P) within TZVC^. Let Sn{P) = 
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[Jp^p ^n{p), where 5„(p) is the set of strings obtained from p substituting 
nonempty constant strings with length at most n for each variable. Then there 
is a positive number n such that Sn{P) is a characteristic set of L{P) within 
TZVC^ ■ We are interested in the positive number n for given P S TZV^. We first 
prove that (i) <S'i(P) C L{Q), (ii) P Q Q and (iii) L{P) C L{Q) are equivalent 
mutually, provided > 2fc + 1. The result is not always valid, if jjif < 2k. We 
show, however, that the above equivalency for (i’) S 2 {P) C L{Q) instead of (i) 
is valid, provided > 2k— 1. Thus S 2 {P) is a characteristic set of L{P) within 
TZVC^ ■ It is known that the membership problem for regular pattern languages is 
polynomial time computable (cf. ShinoharaP2j), although it is NP-complete for 
general pattern languages (cf. Angluinm). Thus the containment for languages 
of TZVC^ is efficiently computable. 

On the other hand, Arimura et al. gave an efficient algorithm of languages 
in TZPC^ under the condition that the class has compactness with respect to 
containment. The class TZV^ has compactness with respect to containment, if 
L{P) O L{Q) implies P QQ for P,QG TZV^. Arimura and ShinoharajS] showed 
the compactness of TZV^ , ifj(A'>2fc+l. In terms of the above equivalency, it 
can be shown that TZV^ has compactness w.r.t. containment, if > 2k — 1. 
Moreover, a counter-example is given so that TZV^ does not have compactness 
w.r.t. containment, if '^E <2k—l. Consequently, the containment for languages 
of TZVd^ reduces to that for TZV^ , and thus it is efficiently computable. 

2 Regular Pattern Languages 

Let A be a finite set of constant symbols containing at least two symbols, and 
X = {x, y, Xi, a; 2 , • • •} be a countable set of variable symbols. We assume EC\X = 

<t>- 

A pattern is a string in (A U A)*. Note that we consider the empty string e 
as a pattern, for convenience. By V we denote the set of all patterns. The length 
of a pattern p, denoted by |p|, is just the number of symbols composing it. A 
substitution 0 is a homomorphism from patterns to patterns that maps every 
constant to itself. For a pattern p and a substitution 9, we denote by p9 the 
image of phy 9. A pattern g is a generalization of a pattern p, or p is an instance 
of q, denoted hy p < q, if there is a substitution 9 such that p = q9. For two 
patterns p and q, if p ^ q and q E P, then p equals q, denoted hy p = q, except 
for labeling variables in them. The set {V,di) constitutes a partial ordering set 
with respect to =. 

The language defined by a pattern p is the set L{p) = {w G E* \ w ^ p}. 
Clearly ifp = q, then L{p) = L{q). A language L over A is a pattern language, if 
L = L{p) for some pattern p. We denote by VC the class of all pattern languages. 

In this paper, we are especially concerned with a subclass of 7^. A pattern p 
is regular, if each variable appears at most once in p. A regular pattern language 
is a pattern language defined by a regular pattern. We denote by TZV the set of 
all regular patterns, and by TZVC the set of all regular pattern languages. 

Concerning regular patterns, the next fundamental result has been shown: 
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Lemma 1 ( ('MnkoncVii|1 OpL Let p and q be regular patterns. Then p -< q if and 
only if L{p) C L{q). 

Note that “if” part of the lemma above is not always valid for general pat- 
terns, although “only if” part is always valid. 

By the result above, the containment problem for regular pattern languages 
is reduced to the decision problem of partial ordering for regular patterns, which 
is polynomial time computable (cf. Shinoha.ra.jl 2|L 

Now we consider unions of languages defined by patterns. By V'^ we denote 
the class of all nonempty finite subsets of V. For fc > 1, let 7^^ be the class of 
sets consisting of at most k patterns. By we denote the class of unions of 
at most k pattern languages, that is, 

rC’^ = {L{P) I P G V'^}, 

where L{P) = IJp^p L{p). In a similar way, we also define TZV^ , TZV^ and 
TZVC^, respectively. 

For P,Qg , we define the binary relation P C Q as follows: P C Q if 
and only if for any p G P, there is q G Q such that p P g. It is easy to see that 
P C Q implies L{P) C L{Q). However the converse is not valid in general. 

Definition 2. A class C C P+ has compactness with respect to containment, 
if for any pattern p G V and any set Q G C, L{p) C L{Q) implies L{p) C L{q) 
for some q G Q. 

In a similar way, we also define compactness for a class C C TZV^ . 

For a class C C TZV^ with compactness, it is easy to see by Lemma E that 
for any P, Q G C, P C Q if and only if L{P) C L{Q). 

In this paper, we show the compactness of the class TZV^ as a corollary of 
stronger property than the compactness as follows: For some particular finite 
subset S of L{p), S C L{Q) implies L{p) C L{q) for some q G Q. Note that 
S C L{Q) implies also L{p) C L{Q). Such a set S is called a characteristic set 
for L{p), which is defined as follows: 

Definition 3. Let C be a class of languages and L be a language. A set S C 
is a characteristic set for L within C, if S is a finite subset of L and for any 
L' G C, S G_ L' implies L C L' . 

If S' is a characteristic set for L G C, L is the least language among £ con- 
taining S in the set-containment ordering, and any finite superset of S contained 
in L is also a characteristic set for L. Furthermore a finite language P G £ is a 
characteristic set for itself. 

The notion of a characteristic set has very closed relation with that of finite 
elasticity due to Wright [Oj as well as that of finite cross property due to SatoJH] 
defined as follows: 
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Definition 4 ( (Wright and Motoki et al. i))- A class C of languages has 
finite elasticity, if there does not exist an infinite sequence (wi)i>o of strings and 
an infinite sequence (Li)i>i of languages in C such that for any i>l, 

{wq,- ■ ■ ,Wi-i} C Li, but Wt ^ Li. 

A condition for a class to have finite elasticity is characterized by the notion 
of finite cross property of a language as follows: 

Definition 5 ((Sato^I|)). Let C be a class of languages. A language L has finite 
cross property within £, if there does not exist an infinite sequence (Tn)n>i of 
finite sets of strings and an infinite sequence (£i)i>i of languages in C such that 
(i) Ti C T 2 C • • •, (a) Ti = L, (Hi) Ti C Li, but g L* (i > 1). 

Lemma 6 (('SatofTT)ri. A class C of languages has finite elasticity if and only 
if every language L has finite cross property within C. 

Furthermore, by their definitions, it is easy to see that the following lemma 
is valid: 

Lemma 7. Let C be a class of languages and L be a language. Then L has finite 
cross property within C if and only if there is a characteristic set for L within 

C. 



By Lemmas and 0 we see that a class C has finite elasticity if and only if 
for any language L, there is a characteristic set for L within C. Note that this 
result has already shown in Kobayashi and Yokomori^. 

Wright [ 13 ] showed that the class VC^ has finite elasticity, and so is the 
subclass TZVC^. Thus, by the lemmas above, we see that for any language L G 
TZVC^ , there is a characteristic set for L within TZVC^ . 

Now we define a particular finite subset of a regular pattern language which 
plays an important role in our paper. For a regular pattern p with just m variables 
xi, - ■ ■ , Xm and for n > 1, we define a finite subset Sn{p) of L{p) as follows: Let 
Sn (p) be the set of all strings obtained from p by substituting strings in with 
length at most n for each variable in p. 

For a nonempty finite set P of regular patterns, we define 

Sr^(P) = U Sr^ip). 

peP 

Clearly Sn{P) C Sn+i{P) ^ L{P) for any n > 1. 

Since a characteristic set for L{P) is a finite set, we have the following theo- 
rem: 

Theorem 8. For any P G TZV^ , there is n > 1 such that Sn{P) is a character- 
istic set for L{P) within TZVC^ . 

In Section 0 we will show that 2 is sufficient for the number n in the theorem 
above, under the assumption that the number of constants is not less than 2fc — 1. 
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3 Si{P) as a Characteristic Set 

In this section, we will give some simple characteristic set for each language in 
TZVC^ . The key is the set S\{p) of strings with the shortest length for a regular 
pattern p. 

Let pirp2 ^ q for regular patterns p\,r,p2 and q, and let xi^-'-^Xn be 
variables appearing in q. The subpattern r vapirp2 is generated from q by variable 
substitution, if there exists a variable Xi in q and a substitution 9 = {x\ := 
ri, •••, Xi := r'rr” , ■■■ , Xn ■= r„} such that pi = {qi 6 )r' , p2 = r" {q 29 ) for 
q = qiXiq2- Note that if the pattern r in p\rp2 is generated from q by variable 
substitution, clearly p\xp2 ^ q holds. In particular, if piap2 ^ q but p\xp2 'ii q 
for some a G S, then the constant a in piap2 is not generated from q by variable 
substitution, and moreover q = q\aq2 holds for some q\ and q2 such that pj ^ qj 
{j = 1 , 2 ). Furthermore, if p\xp2 ^ q, then the variable x in p\xp2 is always 
generated from q by variable substitution. 

For a pattern p, by head(p) and tail(p) we denote the first symbol and the 
last symbol of p, respectively. 

We first give two fundamental lemmas useful in this paper. 

Lemma 9. Let p = p\xp2 and q = qiq2qs, where p,q,p\,p2,qi,q2 and qs are 
regular patterns and x is a variable. Then ifpi ^ qiq 2 , P 2 ^ <Z2<Z3 and q 2 contains 
some variables, then p ^ q holds. 

Proof. Let y be any fixed variable appearing in q2 and q2 = < 722/92 for some q'2 
and q'f. By pi ^ < 7 i(< 722 / 920 ) can put pi = p'lp” for some p) and p'( such that 
p'l ^ < 7 i <?2 ^cid p'( ^ yq'f. Similarly, by p2 ^ (< 722 / 920 '/ 3 ) can put p2 = P2P2 foe 
some P2 and p'2 such that p'2 ^ <72?/ and p'2 ^ <72 < 73 - Now we consider a substitution 
9 = {y ■= pfxp'2}. Then we have p = pixp2 = p'i{Pixp2)p2 d: 9 i 92(^1^2)9293 = 

q 9 dq- □ 

By the result above, if pi d 9i92 and p2 d 9293 but p q, then 92 contains 
no variable, i.e., 92 G if*. 

Lemma 10. Suppose > 3. Let p and 9 be regular patterns. Then if p{x := 
a} ^ 9 > p{x ■= b} d q and p{x := c} d q for distinct constants a, b and c, then 
p d q holds. 

Proof. If p does not contain the variable x, then it is clearly true. Thus we 
consider p = p\xp2 for some regular patterns p\ and p2- 

Suppose p q. As mentioned above, if the constant a in p\ap2 is generated 
by variable substitution from 9, then by p\ap2 d q, P d q holds, which is a 
contradiction. Thus we can put 9 = q^^aq^'^ for some q^'^ and qi^^ such that 
Pj d qa'^ (j = 1 , 2 ). Similarly, from p\bp 2 d 9 and p\cp 2 d 9, we can put 
9 = q^^bq^^^ = 9c^^c9c^^ for some q^\q^\qH^^ and 9).^^ such that pj d q^^^ and 
Pj d qc^ U = 1 , 2 ). 

Without loss of generality, we can put 9 = 91092693094, where 
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(1) Pi ^ = gi, (!’) P2 di = g2^g3cg4, 

(2) Pi d = giag 2 , (2’) p 2 d gf ^ = g3cg4, 

(3) Pi d q^c'’ = qiaq2bq3, (3’) P2 d qP = qi- 

As easily seen, both g 2 and gs contain no variable. In fact, by (2) and (!’), 
Pi d (gio)g 2 and p 2 d q 2 {bq‘ 3 ,cqA) . Thus if g 2 contains some variables, it implies 
by Lemma0that p d q, which contradicts the assumption. Therefore g 2 contains 
no variable. Similarly we can show that gs contains no variable. 

Put w = q 2 and w' = gs. By (2) and (3), both aw and awbw' are suffixes of pi. 
Therefore if |t(;| = |t<;'|, then aw = bw' holds, which contradicts the assumption 
that o yf 6. 

Assume |w| < |?u'|. Then aw is a suffix of w' , so w' = wiaw for some wi G E* . 
Similarly, by (!’) and (2’), both wbw'c and w'c are prefixes of p 2 - Thus wb is a 
prefix of w' , so w' = wbw 2 for some W 2 G E* , and thus |wi| = |w 2 |. This implies 
a = c, because both wbw'c = wbwiawc and w'c = wbw 2 C are prefixes of p 2 , 
which contradicts the assumption that a ^ c. 

We can also show a contradiction similarly for the case of |w'| < |u>|. This 
completes our proof. □ 



Theorem 11. Suppose jjA > 2A: + 1. Let P G and Q G TZP^ . Then the 

following three propositions are equivalent: 

(i) Si{P)CL{Q), (ii)PQQ, (m) L{P)CL{Q). 

Proof. Clearly (ii) implies (Hi) and (Hi) implies (i). Now we prove (i) implies (ii). 
It suffices to show that for any regular pattern p, Si{p) C L{Q) implies p d q 
for some q G Q. 

The proof is done by a mathematical induction on the number n of variables 
in p. In case n = 0, p S L{Q), and so p G L{q) for some q G Q. Let n > 0 
and assume that it is valid for any regular pattern with n variables. Let p be a 
regular pattern with (n+ 1) variables such that S'i(p) C L{Q), and let x be any 
fixed variable in p. Put Pa = p{x := a} for each a G E. Note that pa has just n 
variables and S'i(pa) d L{Q) holds. Thus by the induction hypothesis, Pa d qa 
for some qa G Q. Since jjA > 2/c+ 1 and j)Q < k, there exists at least one regular 
pattern q G Q such that Pa^ d q for some distinct constants aj G E {j = 1,2,3). 
By Lemma cni it implies p d q- □ 

As a direct corollary of this theorem, we have: 

Corollary 12. Suppose jJA > 3. Let p and q be regular patterns. Then the fol- 
lowing three propositions are equivalent: 

(i) Si{p) C L(g), (ii) pdq, (Hi) L(p) C L(g). 

Note that Theorem mi is not valid in general if jjA < 2k. Before illustrating 
a counter-example, we give the following lemma: 
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Lemma 13 . Suppose U 27 > 3 . Let p and q be regular patterns. Then if p{x := 
a} ^ q and p{x '■= b} < q for distinct constants a and b but p q, then there 
exist regular patterns pi,P2, <Zi OLTid q2 and a string w G E* such that 

p = piAwxwBp2, q = qiAwBq2, Pj ^ qj (j = 1, 2), 

PiAw ^ qi, wBp2 A 92 , 

where A = a, B = b or A = b, B = a. 

Proof. Clearly p contains the variable x. Let p = p'ixp'2 for some regular patterns 
p'^ and p'2. Similarly to the proof of Lemma El we can show that there exist 
regular patterns 91 and 92 and a string w G E* such that 9 = q\AwBq2, p'j ^ qj 
{j = 1 , 2 ), p'l < qiAw and p'2 A wBq2. Hence we can put p'^ = piAw and 
p'2 = wBp2 for some pi and p2 such that pj ^ qj (j = 1 , 2 ). It implies p = 
P\AwxwBp2. □ 

By the result above, we can construct the following counter-example for The- 
orem mi 

Example 1 . Let E = {ai, • • • , a^, 61, • • • , bk\ be an alphabet with just 2 k con- 
stants. We consider a regular pattern p and a set Q = {9i> ’ ’ ’ , 9fc} C given 
by 

p = xxaxwxxwxbix2, qt = xiaiWibiX2 (i = 1, 2, • • • , k), 
where wi, - ■ ■ ,Wk are defined recursively as follows: 

= w^+ibr+iOi+iWi+i (i = 1 , 2 , • • • , A: - 1 ), Wk=e. 

For instance, in case k = 3 , W3 = e,W2 = b^a^ and w\ = (6303)6202(^03), 
p = a:i 0 i(( 63 a 3 ) 6202 ( 6303 ))a;(( 6303 ) 6202 ( 6303 )) 6 ia; 2 , 

9 i = a;i0i((63a3)6202(6303))6ia;2, 92 = 2:102(6303)62X2, 93 = xi 03 630:2. 

We will show that p{x := ai} ^ qi and p{x := bi} qi (* = 1 , 2 , • • • , fc). For 
j = 1, we have p{x := 01} = {xiaiW\)ai{wibiX2) = qi{x\ := xiOiWi} ^ 91 and, 
similarly, p{x := 61} = 91(0:2 := WibiX2} A 9i- 

Next for i > 2 , as easily seen, by the definition of Wi, we can put wi = 
{wibi)w^''^ = w'^'^{aiWi) for some strings wb) and lu'b). Thus for each i> 2 , 

p{x := ai} = (xiaiWi)ai{wibiX2) = {xiaiWi)ai{wibiW^’'^)biX2 
= {xiaiWi){a^Wibi){w'-''^biX2) 

= qt{xi := XiOiWi, X2 ■= w^''^biX2} 

^ 9i, similarly, 

p{x := bi} = q^{xl := XiOiw'b), X2 := 101610:2} 

^ q%- 

Hence S\{p) C L{Q). On the other hand, clearly p qi, and so L{p) % L{qi) 
{i = l,---,k). □ 
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4 S 2 {P) and Compactness 

In this section, we show that S2{P) is a characteristic set for L{P) within TZVC^, 
under the condition flli' > 2/c — 1. As a result, the class TZV’^ has compactness 
with respect to containment. 

Lemma 14. Suppose > 3. Let p and q be regular patterns. Then if p{x := 
'f}^q for any r € D, then p{x := xy} E <? holds, where D is either one of the 
followings'. 

(i) D = {ay, by}, or D = {ya, yb}, where a ^ b, 

(ii) D = {aibi, 02627 0363}, where Oi ^ aj and bi ^ bj for i ^ j. 

Proof. If p does not contain the variable x, then it is clearly true. Thus we 
consider p = p\xp2 for some regular patterns p\ and p2. We prove only for the 
case (i) that if piayp2 9 and p\byp2 ^ q, then p\xyp2 ^ q holds. We can prove 
for other cases similarly. 

Assume p\xyp2 q. Let us put p'2 = yp2 and p' = p\xyp2 = Pixp'2. Since 
p'{x := a} ^ q, p'{x := 6} ^ g but p' -f. q, by Lemma El there exist regular 
patterns p'{,P2, qi and 92 and a string w G E* such that pi = p'(Aw, p'2 = wBp'{ 
and q — q\AwBq2, where {A, B} = {o, 6}. By p'2 = wBp'2, head(p2) must be a 
constant, which contradicts that p'2 = yp2- □ 

Lemma 15. Suppose (IE > 3. Let p and q be regular patterns. Then if p{x := 
0} ^ 9 for some a G E and p{x := xy} ^ q, then p A <1 holds, where y is a 
variable not appearing in p. 

Proof. If p does not contain the variable x, then it is clearly true. Thus we 
consider p = p\xp2 for some regular patterns p\ and p2- 

Assume the converse that p\ap2 A 9 and p\xyp2 A 9 but p\xp2 q. Similarly 
to the proof of Lemma El we can show that there exist regular patterns 91 and 
92 and a string w G E* such that 

9 = qiAwBq2, Pj E Qj {j = 1, 2), pi < qiAw, p2 A wBq2, 
where {A,B} = {a,xy}. 

Let A = a and B = xy. By pi ^ q\aw and p2 A wx{yq2), we can put 
Pi = p'lOW and p 2 = rcP2P2 for some _Pi,_P2 and p'} such that p'^ A Qi, p'2 A x 
and P2 ^ 2/92. Consider a substitution 9 = {a; := xwp'2}. Then p\xp2 = 
p'iQwxwp'2p'2 A {qi)o,wxwp'2{yq2) = {{qio,w)x{yq2))0 = qO < q, which contra- 
dicts the assumption. 

Similarly, we can show a contradiction for the case oi A = xy and B = a. □ 
For a nonempty finite set D of regular patterns, we denote 

head(D) = {head(p) | p G D}, tail(D) = {tail(p) | p G D}. 

Lemma 16. Suppose UA > 3. Let p and q be regular patterns. Then if p{x := 
r} ^ q for any r G D, then p A q holds, where D is either one of the followings: 
(i) D = {a, 6, cy}, or D = {a, 6, yc}, where a ^ b, 

(ii) D = {a, by, cy}, or D = {a, yb, yc}, where 6 yf c. 
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Proof. If p does not contain the variable x, then it is clearly true. Thus we 
consider p = p\xp 2 for some regular patterns p\ and p 2 - 

For the case (ii), by Lemma[Hl(i), we have p\xyp 2 ^ q. Since p\ap 2 ^ 9, we 
have p ^ 9 by Lemma [Q 

We prove for D = {a, b, cy} of the case (i). Assume the converse that p\ap 2 ^ 
9, pibp 2 ^ 9 and picyp 2 ^ 9 but p\xp 2 'fi 9- Then clearly the constant a in piap 2 
is not generated by variable substitution from 9, and so are b in p\bp 2 and cy in 
PicyP 2 - If P\xyp 2 ^ 9, since p\ap 2 ^ 9, it follows by Lemma [El that p ^ q, and 
a contradiction. Thus we have p\xyp2 9. Similarly to the proof of Ijemma,|TT!l 
we can show that there exist regular patterns 91 and 92 and strings w and w' 
such that 

(1) 9 = qiAwBw'Cq 2 , Pt di q% {i = 1,2), 

(2) Pi d qiAw, Pi d qiAwBw' , 

(3) P 2 d wBw'Cq 2 , P 2 d w'Cq 2 , 

where D = {A,B,C} = {a,b,cy}. Note that head(D) = {a, 6, c} and tail(D) = 
{a,b,y} (possibly c = a or c = 6, but a yf b). Since Aw is a suffix of AwBw' 
by (2) and w'C is a prefix of wBw'C by (3), it follows that |?u| |?u'| and 

w,w' y^ £. Assume |w| < |?u'|. Then there exist strings wi,W 2 & 27+ such that 
w' = wiw = WW2, and so A is a suffix of AwBwi and W2C is a prefix of BwiwC. 

If A = cy, then tail(wi) = tail(A) = y, which contradicts that wi is a 
constant string. 

If B = cy, then W 2 C contains the variable y, because i? is a prefix of W 2 C. 
In this case, C = a ot b, and thus W 2 C is a constant string. It is a contradiction. 

Finally, if C = cy, then W 2 cy is a prefix of the constant string Bwiw, and a 
contradiction. 

We can prove for the case of |w| > similarly. □ 

Now we present the main theorem in this paper. 

Theorem 17. Suppose fc > 3 and 1)27 > 2fc — 1. Let P G TZV'^ and Q G TZV^ . 
Then the following three propositions are equivalent: 

(i) S2{P)CL{Q), (ii)PdQ, (Hi) L{P)CL{Q). 

Proof. It suffices to show the case of = k and |)27 = 2fc — 1 and that ofiQ = k 
and 1)27 = 2k. Other cases can be reduced to Theorem m 

We show that for any regular pattern p, 52 (p) C L{Q) implies p d q for some 
q G Q, when j)Q = fc and j)27 = 2A: — 1. The case of j)Q = fc and |)27 = 2k can be 
shown similarly. Put Q = {91, • • • , 9fe}. 

The proof is done by a mathematical induction on the number n of variables 
in p. In case n = 0, S 2 {p) = {p}, and so p G L{Q). Hence p d q for some q G Q. 
Let n > 0 and assume that it is valid for any regular pattern with n variables. 
Let p be a regular pattern with (n + 1) variables such that S 2 {p) C L{Q). 

Assume p qi {i = 1, ■ ■ ■ ,k). Let x be any fixed variable in p and p = pixp 2 
for some pi and p2. For a,b G 27, put pa = p{x := a} and Pab = p{x := ab}. 
Note that both pa and Pab contain just n variables and that S 2 {pa) C L{Q) and 
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S 2 {Pab) C L{Q) hold. By the induction hypothesis, for any a,b G U, there exist 
i,i' < k such that Pa ^ qi and Pab ^ qe ■ 

For each i < k, put Di = {a G E \ pa ^ qi} and define a bigraph Gi = {V, Ei), 
where the set V of vertices consists of two sets E and E = {a \ a G E} and 
the set Ei of edges is defined by Ei = {(a, 6) | Pab E qi}- Note that any cycle 
in a bigraph has even length. For each a,b G E, degj(o) (resp., degj(6)) means 
the number of b’s (resp., a’s) such that pab E qi- We note that, as easily seen, 
ULi Di = E and IjLi E^ = E x E hold. 

If iDi > 3 for some i, then p ^ ft by Lemma El and a contradiction. Thus 
'^Di < 2 for any i < k. Moreover, since '^E = 2k — 1 and IJfci ^i = ^ follows 

that ‘^Di > 1 for any i < k. 

Here we note that for the case of t|Di = 2, by LemmalTCUil. p\ayp 2 qi and 
Piybp 2 'ii qi hold for any a,b G E. Therefore, by Lemma ITU there exist neither 
distinct constants aj {j = 1,2,3) nor bj (j = 1,2,3) such that Pabj E qi and 
Pajb E qi-, and thus degj(a) < 2 and degj(5) < 2 for any a,b G E. These mean 
that any connected component of the bigraph Gi is a cycle with even length or 
a chain. 

For the case of = 1, by Lemma El (ii), the following four cases are 
possible: 

(1) piayp 2 qi and piybp 2 qi for any a,bG E, 

(2) piayp 2 ft and p\ybp 2 ft for any a G E - {oo}, bG E - {bo}, 

(3) piayp 2 ii qi and piybp 2 ii qi for any a G E - {oo}, b G E, 

(4) piayp 2 ii qi and piybp 2 ^ qi for any a G E,b G E - {6q}, 

where ao,bo G E are some constant symbols. 

For the case (1), any connected component of Gi is similarly shown to be 
a cycle with even length or a chain. For the case (2), we consider a bigraph 
G'i obtained from the bigraph Gi by deleting the vertices oq and bo- Clearly 
degj(a) < 2 and degj(6) < 2 in the bigraph G' for any a G E — {ao} and any 
b G E — {6o}. Thus any connected component of the bigraph is a cycle with even 
length or a chain. For the cases (3) and (4), we can similarly get subbigraphs of 
Gi whose connected components are all cycles with even lengths or chains. 

Hereafter, we prove the following claim: 

Claim: For some io < k, the bigraph Gig contains at least three edges {aj,bj), 
j = 1, 2, 3 such that aj ^ aj' , bj ^ bj/ for j ^ j' . 

Proof of the claim. Since ‘^E = 2k — 1 and Di = E, it follows that t|Di = 2 
for at least (fc — 1) i’s, say, 1, 2, • • • , fc — 1, and = 1 or 2. 

(i) In case '^Di = 2 for any i < fc. As noted above, for any i < k, all 
connected components of the bigraph Gi are cycles with even lengths and chains. 
As mentioned above, for any a,b G E, there exists i < k such that Pab E qi, and 
thus for any edge (a, 5), there exists i < k such that (o,6) G Ei. Since there are 
(2fc — 1)^ possible edges, there exists iq < k such that the bigraph Gig contains 
at least (4fc — 3) (> (2fc — l)^/fc) edges. Since degj^(a) < 2 for any a G E, 
it means that degj^(a) = 2 for at least (2fc — 2) a’s, and degjjj(a) < 1 for at 
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most one a. Hence the bigraph Gi„ consists of some cycles with even lengths 
and at most one chain. Note that a cycle with length 21 contains distinct I edges 
which are not adjacent mutually. As easily seen, Gi„ contains a set of edges 
{aj, bj), j = 1, ■ ■ ■ ,2k — 1 which are not adjacent mutually. Since fc > 3, we have 
2A: — 1 > 5, and thus our claim is valid. 

(ii) In case '^Di = 2 for any i < k and '^Dk = 1. For the case (1) mentioned 
above, we can show similarly to (i). Let us consider the case (2) above. For any 
i < k, let G' be a subbigraph obtained from Gj by deleting the vertices oq and 6 q- 
Then for any i < k, any connected component of the bigraph G' is a cycle with 
even length or a chain. Similarly to (i), since there are {2k — 2)^ possible edges 
(a, b) each of which is contained in at least one bigraph considered, there exists 
io < k such that the bigraph G{^ contains at least Ak — 1 (> {2k — 2)^/k) edges. 
It implies that degjjj(a) = 2 for at least (2fc — 5) a’s and degjjj(a) < 1 for at most 
three a’s. In particular, if deg^^ (a) = 2 for just {2k — 5) a’s, degjj^ (a) = 1 for the 
other three a’s. Moreover, if degj^ (a) = 2 for just {2k — 4) a’s, degj^ (a) = 1 for 
at least one a. In any case, Gi„ contains a set of edges {aj,bj), j = 1, ■ ■ ■ ,2k — 3, 
which are not adjacent mutually. Since fc > 3, we have 2/c — 3 > 3, and thus our 
claim is valid. 

We can prove our claim for the cases (3) and (4) similarly. ■ 

Appealing to Lemma we have pixyp 2 ^ qig, and thus p ^ qihy Lemma 
im This contradicts our assumption. □ 

As a direct corollary of this theorem, we have: 

Corollary 18. Suppose fc > 3 and {iS >2k — 1. Let P S TZP'^ . Then S 2 {P) is 
a characteristic set for the language L{P) within TZVC^. 

Lemma 19. If UA < 2k — 2, then the class TZP^ does not have compactness 
with respect to containment. 

Proof. Let A = {ai,- • • ,ak~i,bi,- ■ ■ ,bk-i} be an alphabet with just {2k — 2) 
constants. Let p,qi and Wi {i = 1, - • • ,fc — 1) be regular patterns and strings 
defined in ExampleEl where Wk-i = s. Then let qk = xiaiWixywibiX 2 . 

As shown in Example 0, p{x := a^} ^ qi and p{x := bi} A qi for i = 
1, 2, • • • , fc — 1, and thus Si{p) C IJ L{qi). On the other hand, clearly, for any 
string w with |w| > 2, p{x := w} A <Zfc- These mean L{p) C L{Q). However, 
clearly, p qi, and so L{p) ^ L{qi) {i = 1, 2, • • • , fc). Therefore does not 
have compactness w.r.t. containment. □ 

By Theorem 113 and Lemma El we have the following theorem: 

Theorem 20. Suppose fc > 3. Then the class TZV^ has compactness with respect 
to containment if and only if j)A > 2fc — 1. 

Note that, independent of ours, Arimura and Shinohara[S| showed that if 
jjA > 2fc+ 1, then the class TZV^ has compactness w.r.t. containment, and so is 
not if t|A = fc + 1. Theorem El completely fills the gap on the number of constant 
symbols. 

The following example is a counter-example for Theorem ^3 in case fc = 2. 
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Example 2. Let S = {a, b, c} be an alphabet with just 3 constants. We consider 
regular patterns p, qi and <72 given by 



p = xiaxbx2, qi = Xiabx2, q2 = x\cx2- 

For any w G if c appears in w, then p{x := w} ^ 52 holds. Otherwise, 
as easily seen, p{x := w} ^ qi holds. It implies that L{p) C L{qi) U ^( 52 ), but 
clearly p ^ gi (i = 1,2). □ 

Thus Theorem [HI is not always valid for k = 2. However, we obtain the 
following result for the case oi k = 2. 

Theorem 21. Suppose DT' > 4. Let P G TZV^ and Q G TZV^ . Then the follow- 
ing three propositions are equivalent: 

(i) S2{P)QL{Q), (li)PQQ, (m) L{P)CL{Q). 

Proof. We can prove similarly to the proof of Theorem ini □ 

As direct corollaries of this theorem, we have: 

Corollary 22. Suppose > 4. Let P G TZ'P'^ . Then S 2 {P) is a characteristic 
set for the language L{P) within TZVC^. 



Corollary 23. Suppose jJA > 4. Then the class TZV^ has compactness with 
respect to containment. 

The following corollary would be very useful in the theory of inductive infer- 
ence of recursive languages from positive data from the view point of efficiency: 

Corollary 24. Suppose k > 3 and 1)27 > 2A: — 1. Let P G TZV'^ and Q G TZV^ . 
Then for any subset S of L{Q), if S 2 {P) C S, then P C Q holds. 

Furthermore, because for any regular patterns p and q, whether or not 
p ^ q is computable in time polynomial of the sum of lengths of p and q (cf. 
Shinohara ll2l '). it follows that containment problem for unions are efficiently 
computable, that is, we have the following corollary: 

Corollary 25. Suppose k > 3 and > 2k — 1. For any P G TZV^ and Q G 
TZV^ , whether or not L{P) C L{Q) is computable in time polynomial of the total 
length of patterns appearing in P and Q. 
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Abstract. The present paper deals with the problem of finding a consi- 
stent one- variable pattern from incomplete positive and negative exam- 
ples. The studied problems are called an extension E, a consistent exten- 
sion CE and a robust extension re, respectively. Problem E corresponds 
to the ordinary problem to decide whether there exists a one-variable 
pattern that is consistent with the given positive and negative examples. 
As for the other problems, an example string is allowed to contain some 
unsettled symbols that can potentially match with every constant sym- 
bol. For the problem CE, one has to decide whether there exists a suitable 
assignment for these unsettled symbols as well as a one- variable pattern 
consistent with the examples with respect to the assignment chosen. 
Problem re is the universal version of problem CE, i.e., now one has to 
decide whether there exists a one- variable pattern that is consistent with 
the examples under every assignment for the unsettled symbols. 

The decision problems defined are closely connected to the learnability 
of one- variable pattern languages from positive and negative examples. 
The computational complexity of the decision problems dehned above is 
studied. In particular, it shown that re is NP-complete. 



1 Introduction 

A pattern is a non- null string over U U X, where A is a finite alphabet of 
constants and X = {xq, cci, . . .} a countably infinite alphabet of variables. For 
example, axoxixobxo and bbxoaaxoxixi are patterns over {a, b}UX. A pattern tt 
is said to be a fc-variable pattern if at most k different Xi G X appear in tt. 
The language L{tt) generated by a pattern tt is the set of all constant strings 
obtained by substituting nonempty strings for the variables of tt (cf. P^)- order 
to motivate our research, we shortly recall results concerning the learnability of 
pattern languages and their relation to decision and constructibility problems. 
For any background concerning the definitions and properties of relevant learning 
models we refer the interested reader to 

A string w is called a positive example of a pattern tt if w G T(7r). Further- 
more, every infinite sequence of positive examples eventually exhausting L(tt) is 
said to be a positive presentation for L{tt) . The set of all pattern languages is an 
important and prominent language family that can be learned in the limit from 
positive presentations. The first result in this regard goes back to Angluin Q 
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who showed that any algorithm computing a descriptive pattern on input a fi- 
nite sample of positive examples can be transformed into a learner for the whole 
class of pattern languages from positive presentations. Here, a pattern tt is called 
descriptive for a finite set S of strings, if 5" C L(7 t) and for any other pattern 
7 t' such that S C L(7t'), % L{n) (cf. Q). However, no polynomial-time 

algorithm computing descriptive patterns is known, and finding a descriptive 
pattern of maximum possible length is known to be NP-complete (cf. Q). 

As for the special case of one-variable patterns, descriptive patterns can be 
computed in time 0(n^ log n), where n is the size of the input sample (cf. ^). 
Nevertheless, the resulting learner is not optimal with respect to the expected 
overall time taken until convergence (cf. [ I I j ) . Giving up the idea to find descrip- 
tive one-variable patterns at all enabled Reischuk and Zeugmann im to design 
a learner achieving linear total learning time, that is 0 (| 7 t |), with probability 1 
for a huge class of probability distributions. 

Moreover, Angluin £^lso proposed to study the complexity of the inclusion 
problem for pattern languages, i.e., given any patterns tt and r, to decide whether 
L(7t) C L(t). Recently, Jiang et al. 0 have shown the inclusion problem to be 
undecidable, and this has also negative consequences concerning the learnability 
of all pattern languages from positive presentations under certain monotonicity 
constraints (cf. j I ti) 1 . 

Next, we turn our attention to problems more closely related to the research 
presented in this paper. A string w is called a negative example of a pattern tt 
provided w ^ Given two finite sets V and JV, the consistency problem is 
to decide whether there is a pattern tt such that V C L{tt) and Affl L{'k) — 0. 
This problem is sometimes also referred to as separability. Let SEP denote the 
separability problem and CSEP the problem to construct a separating pattern. 
Wiehagen and Zeugmann jTJ showed CSEP to be NP-hard. Additionally, the 
latter result implies that the class of all pattern languages is not inferable from 
positive and negative data by a learner achieving both polynomial update time 
and outputting exclusively consistent hypotheses, unless P = NP (cf. El)- Fur- 
thermore, their result also sharpens the previously known fact that Gold’s |Bj 
identification by enumeration principle cannot be used for computing a separa- 
ting pattern in polynomial time, since Angluin Q has proved the membership 
problem for the pattern languages to be NP-complete. Here, the membership 
problem is defined as follows. Given as input a string w and a pattern tt, decide 
whether w G L{7 t). 

Note that CSEP and SEP are also closely related to the PAG-learnability 
of all pattern languages. If CSEP would be in P, then the class of pattern lan- 
guages would be PAG learnable with respect to the hypothesis space consisting 
of all pattern languages. On the other hand, since the membership problem for 
patterns is NP-complete, the set of all patterns does not constitute a polyno- 
mial time evaluable hypothesis space as usually required in PAG learning. But 
the pattern languages are also known to be not PAG learnable for any such 
polynomial time evaluable hypothesis space, unless P I poly = ^P/poiy (cf. [T^l. 
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On the other hand, the pattern languages have already found highly non- 
trivial applications for real world data sets (cf. jSj). Thus, understanding what 
makes the decision and constructibility problems outlined above hard, is of major 
importance. We continue along this line of research by looking at the class of one- 
variable patterns. While it has been known that membership and inclusion are 
decidable in polynomial time, the complexity of the consistency problem remains 
open. We attack this problem by adapting the notion of monotone extensions 
introduced in jS] in the setting of learning Boolean concepts. 

Besides the motivation given in Boros et al. ^ it may also be helpful to 
look at the problems extension, consistent extension, and robust extension from 
the point of view how noisy data may influence the complexity of learning. 
These problems are defined as follows. Given a set of positive data V and a 
set of negative data J\f, a one-variable pattern tt such that V C L( 7 t) and J\f fl 
L( 7 t) = 0 is called an extension. The problem of deciding whether there exists 
an extension for the data given is denoted by E, i.e., E is just the consistency 
problem for one-variable patterns. Since real world data may be noisy, e.g., by 
containing indefinite values it is only natural to look at the following versions of 
E. Allowing strings to contain indefinite values can be modeled by introducing a 
wild card * as a placeholder. Thus, now we are given strings over the alphabet of 
constants plus *. There are two kinds of interpretations. One is that we consider 
an establishment of the indefinite values to be critical for our hypothesis. In this 
case, we must settle all the indefinite values such that there exists an extension 
with respect to the settlement. The other is that the value does not influence our 
hypothesis. In this case, our requirement is an extension that is consistent with 
the data with respect to every settlement. The resulting decision problems are 
called consistent extension CE and robust extension re, respectively. Moreover, 
allowing indefinite values directly yields two natural versions of the membership 
problem. We study the complexity of all these problems. In particular, we show 
that RE is NP-complete. 



2 Preliminaries 

For each finite set S, the cardinality of S is denoted by 11511. An alphabet is a 
finite set of symbols, denoted by S. The free monoid over S is denoted by E* , 
and the set of all nonempty strings is denoted by A+, where S* = A+ U {e}, 
and e is the empty string (cf., e.g., 0). In particular, for each n > 1, we denote 
the set of all strings of length n over E by A" . 

Let rc be a string of the form a/Jy. Then a, f3, and 7 are called substrings 
of w. A substring of w is said to be a proper substring if it is not equal to w. 
Furthermore, we refer to a and 7 as to a a prefix and a suffix of w, respectively. 
The string a is called a proper prefix \i a ^ w, and a proper suffix is defined 
similarly. Let a and (3 be strings. By jl(a, (3), we denote the number of occurrences 
of /3 in Of. The length of a string a is denoted by |a|. The i-th symbol of a from 
the left is denoted by a[i]. 
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Let X he a symbol not belonging to S. Every tt G (Z’U {a;})"*" is called a one- 
variable pattern and x is referred to as the pattern variable of tt. If tt(7r,a:) > 1, 
then the pattern tt is called proper. The set of all one- variable patterns is denoted 
by Pat. For any tt G Pat and u G , the expression 7r[a:/u] denotes the string 
w G which is obtained by replacing all occurrences of a; in tt by string u. 
The string u is called a substitutiorQ for x. For every tt G Pat, we define the 
language of tt by 



Lirr) =df {w G I w = 7r[a:/u],u G ^^}- 

Let * be a special symbol not belonging to ifU {x}. A string w G (A U {*})■*■ 
is called incomplete if jl(w,*) > 1. Let V and N be finite subsets of (A U {*})■*■ 
such that P n Af = 0. A member of P is called a positive example and a member 
of J\f is said to be a negative example. For an alphabet E, denotes the 
set of all partial functions {E U {*})"'' >— t such that if G 'Ps iff for each 
w G (AU {*})'’’, if S(w,Tir) = 0, then f{w) = w, and if tl(w,*) > 1, then ip{w) is 
a string w' G such that for any 1 < j < |w| and w[z] ^ *, w'[i] = w[i]. 

Next, we define the decision problems mainly studied in this paper. Let V, Af 
be finite sets of positive examples and of negative examples, respectively, such 
that V, Af Q A+. A pattern tt G Pat is said to be consistent with V U Af 

if V U Af C A+, P C L{tt) and N fl L{tt) = 0. Thus, if there is a consistent 

pattern for P\JAf, then such a pattern can be thought of as explaining the data 
given. Moreover, since every pattern language generated by a proper pattern is 
infinite, a consistent pattern for P\JAf can be also regarded as a generalization or 
extension of the data sets P and Af . Following Boros et al. P], we call therefore 
call a consistent pattern an extension. 

Now, let P, Af C (A U {*})'*' be any finite sets. We say that there is a 

consistent extension for P U Af iff there are a ip G Ps and a pattern tt G Pat 

such that TT is consistent with ip{P) U ip{Af). Finally, there is a robust extension 
for PiJAf provided there exists a pattern tt G Pat such that tt is consistent with 
f{P) U f{Af) for all ip G Ps- 

Definition 1. The decision problem extension, denoted by E is the problem to 
decide, on input any sets P, Af C A+ whether there exists an extension for 
PUAf. 

Consistent extension (abbr. Ce) is the problem to decided, on input any sets 
P, Af C (A U {*})'’’, whether there exists a consistent extension for P U Af. 

Finally, robust extension (abbr. re) is the problem to decided, on input any 
sets 7^, Af C (A U {*})'*', whether there exists a robust extension for P U Af. 

Clearly, the complexity of these problems is measured in the length of the input. 

Next, we exemplify these problems. 

Example 1. Let A = {a,b,c}, P = {abacababacacabaca, abacabaca} and Af = 
{aabaacabaca} . Then, there are three patterns tti = x, tt^ = xbxcabaca and 



^ In this study, no erasing is assumed, i.e., any possible substitution is not e. 
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7T3 = abacabxcx that are consistent with V , but tti and tt 2 are not consistent 
with M because 7Ti[a;/aa6aaca6aca] = 7r2[x/aa] = aabaacabaca. On the other 
hand, is consistent with ViJAf , and thus, there exists an extension of . 

Next, let V = {abacababacacabaca, abacabaca} and let J\f = {-kabb ->r cabaca} . 
Then, the answer to CE is “yes,” since tt = xbxcabaca is consistent with 'P\Jip{Af) 
for ip(kabbk cabaca) = aabbacabaca such that 7T2 is consistent with V and f(Af). 

Finally, let V = {a-kb, abkabakcb} and Af = {bbkcba}. Then, the tt = axb 
is consistent with ip{V) and f(Af) for every ip G Thus, in this case, there 
exists a robust extension. 

In this study, we generally assume that ||i7|| > 2 because of the following 
reasons. Let S consist of one symbol, say S = {a}. Then, there exist trivial 
reductions from CE to E and from re to E by replacing all * in given strings by the 
same symbol a. For each w G let L^{i,j) = {tt G Pat \ tt(7r,a) = i, tt(7r,a:) = 
j, tt[x/u\ = w, |u| = (|w| — i)/j}. Since HT'II = 1, for any w,w' G T’'*" and 1 < 
i,j < min{|w|, |w'|}, it holds that L^(iJ) = iff yf 0. 

Thus, if Ill'll = 1, then the problem E, CE and RE are decidable in polynomial 
time in the size of the sum of the length of given strings. 

3 Comparing the Difficulty of E and CE 

In this section, we study the complexity of problems E and CE. First of all, we 
would like to establish an upper bound for the complexity of these problems, 
i.e., we aim to show that both are in NP. This is straightforward for E, since 
membership for one- variable patterns is in P (cf. [Q). Thus, we may just guess a 
pattern tt and check whether or not all strings from V are contained in L(tt)^ and 
none of the strings from Af does behave thus. Therefore, it is only natural to try 
the same approach for CE. However, this immediately leads us to the following 
version of the membership problem. 

Definition 2. For any given w G (ifU {*})■*■ and tt G Pat, the existential mem- 
bership problem, denoted by 3Mem(7r, rc), is the problem of deciding whether 
there exists a ip G (As such that ip{w) G L{tt). 

Lemma 1. 3Mem{TT,w) G P. 

Proof. Let tt = v\XV 2 X ■ ■ ■ VnXVn+i be the input pattern, where vi G S* for all 
1 < i < n 1. First, we check whether or not |w| > |7 t| in 0{\tt\ |w|) time. 

If |w| < |7 t|, then <p{w) ^ L{tt) for all (p G (As- If |w| > |7 t|, then we compute 
TO = |w| — X)i<i<n+i 1^*1 whether or not m/n is a positive integer. 

This can be done in 0(|ii;|) time. If m/n is not a positive integer, then there 
exists no substitution u such that 7r[a:/it] = w. Now, let k = m/n be a positive 
integer. Then, w is of the form w = W 1 S 1 W 2 S 2 ■ ■ ■ WnSnWn+i such that for each 
1 < * < n -I- 1, |rci| = |ui| and |si| = • • • = |s„| = fc. 

We can check whether or not there exists anl<i<n-|-l and a 1 < j < |ui| 
such that Vi[j] ^ Wi[j] and Wi[j] * in 0(|?c|). If there exists such i and j, then 
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ip{w) ^ L(tt) for any ip G (!>s- If not, for each 1 < j < n, we next compute the 
string aj = si[j]s 2 [j] • • • Sn[j]. If an aj contains two different constants, then 
there exists no ip G that maps all si [j], . . . , Sn[j] to the same constant. Thus, 
ip(w) ^ L{tt) for any ip G Conversely, if for each j, the aj contains at most 
one constant, then there exists a, (p G such that ip{w) G L{tt). The time to 
check all the aj is 0{kn) = 0{m) < 0(|i(;|). Hence, the time to decide whether 
there exists ip G such that ip{w) G L(7t) is 0(max{|7r| -I- |w|}). Q.E.D. 

Consequently, Lemma E implies that, with respect to polynomial-time, in- 
complete strings do not make the membership problem for one- variable patterns 
difficult. Moreover, Lemma d directly allows the following corollary. 

Corollary 1. CE G NP. 

Clearly, the next question arising naturally is whether or not CE is even in P 
or NP-complete. However, while we must leave this problem open, our next result 
will shed some additional light on the question which problem is more difficult, 
E or CE. For that purpose, we assume the restriction to CE that the given sets of 
positive examples V contain only constant strings, i.e., V C Z’+. The restricted 
problem is denoted by rce. The next theorem clarifies the relation between rce 
and E. 

Theorem 1. E and RCE are equivalent with respect to polynomial-time reduc- 
tions. 

Proof. Clearly, E is polynomial-time reducible to rce. For the opposite direction, 
suppose a finite set V C and a finite set Af C (If U {*})'’’, where ||I7|| > 2. 
We want to construct sets V and AT such that the answer to E, on input V 
and Af', is “yes” iff RCE is answered thus on input V and Af. We set V — V and 
E' — {0, 1}, where E fl {0, 1} = 0. For each w G Af, define w' over E' such 

that for each 1 < i < |w|, w'[i] = w[i] if w[i] G E, w'[i] = 0 if w[j] G E for all 
j < i and w'[i] = 1 otherwise. That is if w G 17+, then w' = w and if jKic,*) > 1, 
then w' is obtained by replacing the first * in iti by 0 and by replacing all other 
* in w by 1. Finally, we define AT to be the set of all w' obtained. 

Since V — V C 17+ , any pattern containing 0 or 1 is not consistent with V' . 
Thus, it is sufficient to show that there exist an G (17U{a;})+ and aip G such 
that ip{Af)C\L{n) = 0 iff there exists an' G (17U{x})+ such that 7V’'nL(7r') = 0. 
Moreover, since for any w G Af, w G 17+ iff w G Af' , we can assume that each 
string in Af contains at least one *. Thus, for each w' G Af', tt(zi;',0) = 1. 

First, assume that there are a ip G and a pattern n G (17U{a:})+ such 
that ip{Af) n L{n) = 0. 

Let tl(7r, x) > 2. Since '^{n, 0) = {((tt, 1) = 0 and for each w' G Af' , 'i{w' , 0) = 1, 
then, there exists no substitution u G 17'+ such that n[x/v\ G Af . Thus, in this 
case, Af fl L{n) = 0. 

Let ^(n,x) = 1. Then n is of the form n = axf3, where a[3 G 17+. Since 
ip{Af) n L{n) = 0, for any w G Af, there exist the following three cases. 
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Case 1: |w| < | 7 t|, Case 2: There exists a prefix a' of w such that \a'\ = |a| 
and a' ^ a and Case 3: There exists a suffix (3' of w such that \(}'\ = \j3\ and 
j3' ^ /3. Let the string w & N he reduced to a tu' G M' . In Case 1, since Itc'l = |tc|, 
it is clear that w' ^ In Case 2, either a' G 27+ or jj(a',*) > I. If a' G 27+, 
then it is also a prefix of w'. Thus, 7 r[a;/'u] ^ w' for any u G 27'+. If tl(a',*) > 1, 
then there exists an 1 < t < |a'| such that w'[i] G {0, 1}. Thus, 7 r[a;/'u] yf w' for 
any u G 27'+. Case 3 is analogous. Hence, there exists a tt G (27 U {a:})+ such 
that TV' n L{tt) = 0. 

Conversely assume that there is a tt G (27 U {x})+ such that for any w' G 
N' , w' ^ Let the string w' G TV' be reduced from a, w £ Af. Let tt = 

V 1 XV 2 X ■ ■ ■ VnXVn+i, where Vi G 27* for each 1 < i < n + 1. Since |w'| = |w|, 
if TO = (|w'| — (|7 t| — n))/n is not a positive integer, then (p{w) ^ L{-k) for 
any {p G <?i;. If to is a positive integer, then w = W 1 S 1 W 2 S 2 ■ ■ ■ WnSnWn+i and 
w' = w[siW 2 S 2 ■ ■ ■ w'nSnW'^_f_i such that for each 1 < z < n + 1, Izcij = |w'| = |ui| 
and \si\ = ■■■ = |s„| = |s(| = • • • = |s'„| = to. 

If there exist anl<z<n+I and a I < j < |zc'| such that zc'[j] G 27 and 
7 ^ then, since Wi[j] = w[[j], for any tp £<Pe, ^{w) ^ L{tt). 

If there exist anl<z<n+I and a I < j < |zc'| such that w'[j] G {0, 1}, 
then Wi[j] = *. Thus, there exists a ip £ such that it maps the Wi[j] to a 
symbol not equal to the Vi[j] G 27. It follows that p{w) ^ 

Thus, we can assume that Vi = w[ = Wi for all 1 < z < n + 1. The remaining 
parts are Case (a): H(7r,a;) = 1 and Case (b): H(7r,a;) > 2. 

In Case (a), w' = w[s{w 2 - It follows that w' G a contradiction. In 

Case (b), there exist s', s' G {s'^, S 2 , . . . , s(j} such that i ^ j and s'[fc] = 0, 
where I < fc < |s'|. Since Si[k] = *, if Sj[k] £ 27, then there exists a p £ such 
that it maps the Si[k] to a symbol not equal to the Sj[k], and if Sj[k] = *, then 
there exists a p £ <1 >e such that it maps the Si[k] and Sj[k] to different symbols. 
Hence, there exist a tt G (27 U {a;})+ and a p £ <Pe such that p{Af) H L( 7 t) = 0. 
Therefore, there exist a tt G (27U{a;})+ and ap £ <Pe such that p{J\f)r\L{n) = 0 
iff there exists a tt' G (27 U {a;})+ such that AT fl = 0. Q.E.D. 

From this Theorem, it seems that incomplete strings as negative examples 
cause the difficulty of the consistency problem. However, it is open whether there 
is a gap between rce and CE. Moreover, there is a chance that E and RCE are 
members inside NP, e.g., E and RCE in randomized P or P. 



4 Analyzing the Complexity of RE 

The aim of this section is to show the NP-completeness of problem re. Again, 
we begin with a version of membership adapted to RE as follows. 

Definition 3. For any given w £ (27 U {*})+ and tt G Pat, the universal mem- 
bership problem, denoted by VMem(7r, w), is the problem of deciding whether for 
each p G <Pe, ‘p{w) £ L{tt). 
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Lemma 2. \/Mem{7r,w) G P. 

Proof. Let tt = V\XV2X ■ ■ ■ VnXVn+i be the input pattern, where vi G S* for all 
1 < * < n+ 1 . The test whether or not |i(;| > |7 t| can be done in 0(|7 t| -|- |?c|) time. 
If |w| > |7 t|, then similarly to Lemma CJ we compute m = \w\ - J 2 l<^<n+l 1^*1 
and check whether or not m/n is a positive integer in 0 (|rt;|) time. Let m/n = k 
be a positive integer. Then, w is of the form w = W1S1W2S2 ■ ■ ■ such 

that for each 1 < i < n -|- 1, |u>i| = \vi\ and |si| = • • • = |s„| = k. 

We can check whether or not there exists an 1 < i < n-l -1 such that Vi ^ Wi in 
0 (|w|) time. If there exist such Vi and Wi, then ^ L{n) for a G If not, 
for each I < j < n, we next compute the string aj = Si[j]s2[j] ■ • • Sn[j]- Since we 
have Vi = Wi, for each (p G <Ps, <p{w) G L{tt) iff for each 1 < j < n, si[j] = ■ ■ ■ = 
Sn\j]. The time to check all the Uj is 0 {kn) = 0 {m) < 0 (|w|). Hence, the time 
to decide whether >p{w) G L{tt) for all ip G <Pe is 0(max{|7r| -|- |w|}). Q.E.D. 

We recall the problem of the robust extension re for one- variable patterns in 
Definition ^ This problem requires the universal consistency of a one-variable 
pattern for every p G Now, we give a log-space reduction from the 3 -SAT 
to RE below. 

Lemma 3. 3 -SAT is log-space reducible to RE. 

Proof. Let C = Ci A C2 A • • • A Cm be a 3 -CNF of n variables xi,X2, ■ . ■ ,Xn such 
that Ci = V V fig), where each £i. denotes a positive or negative literal 
of Xi ^ , that is, £ij G {xi ^ , ~<Xi^ },!<*<«■ and 1 < j < 3 . Let us fix an alphabet 
consisting of n -I- 3 symbols such that S = {oi, 02, . . . , a„+i} U {A, B}. First, we 
compute the strings a\, U2, 03 and 04 as follows. 

tti = aiA^02i?a„A^aji+i, 

tt2 = aiAa2Ao3 • • • a„Aa„+i, 

tt3 = aiA^02A^a3 • • • a„A^a„+i and 

tt4 = aiA^02A^a3 • • • a„A^a„+i. 

Next, for each clause Ci = V £i^ V fig), we compute the string ( 3 i = 
0171027203 • • • o„7„o„+i such that for each 1 < j < n, 7^- = BA if Xj G 
{fi,^ , fig , fig}, jj = AB if -iXj G {fig, fig, fig} aud ^ othcrwisc. Finally, we 
output V = (03, 04} as the set of positive examples and Af = (oi, 02} U {Pk \ 
1 < k < m} as the set of negative examples. The idea for the string Pk G Af is 
illustrated in Fig.l. We first prove the following Claim. 

Claim: If a pattern tt G Pat is consistent with the strings a±,a2, 03, and 04, 
then the tt is of the form tt = 0iAi02A203 • • • o„A„o„+i, where Xi G {Ax, xA{ 
and I < j < n. 

Suppose that there exists a pattern tt G Pat consistent with the 01,02,03, 
and 04. There are the following cases for a substitution u G A+ such that 
tt[x/u] G {01,02,03,04}. 

Case 1 : A substitution u contains an Oi G {oi, 02, . . . , a„+i} such that 
tt[x/u] G V. Since any Oi appears in the strings at most once, the tt must be of the 
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C = Cl A C2 A C3 A C4 



/3i C3 = (xi V -iXj V Xk) 

h 

f3^ = tti 'k 'k(X2 k -k . . . CtiBAcii^i . . . UjABUj^x . . . (XkBA.€ik+l ■ ■ ■ 




/?4 

Negative examples 



Fig. 1. Negative examples computed from a 3-CNF C of n- variable. 



form wxw' such that w € {01,01^,01^4!} and w' € |o„+i, 4 lo„+i, AlAo„+i}. 
For any possible w and w', the tt is not consistent with the negative example ai. 
Thus, Case 1 is removed. 

Case 2 : 7r[a;/4l4l] = 03. Then, tt = 01X102^2^3 ••• an-^nOn+i, where Xi G 
{4I4I, a;} and 1 < f < n. If Xj = AA for a 1 < j < n, then the tt contains 
GjAAaj+i- Since jl(Q;4, o^^^Oj+i) = 0 , it is a contradiction. Thus, in this case, 
the TT is of the form 01x02x03 • • • o„xo„+i. This pattern is inconsistent with the 
negative example 02- 

Case 3 : 'K\xjAA\ = 04. Then the tt contains one of the strings oiX^Oj+i, 
OiXxOi+i and OjxAoi+i. In case of oiX^oi+i, the tt is not consistent with the 
03. It is a contradiction. Thus, for each I < t < n, the substring of tt between o^ 
and Oi+i must be Ax or xA. This tt satisfies the Claim and it is consistent with 
all q:i,q; 2,Q;3, and 04. 

Case 4 : 7r[x/4l] = a^. If the tt contains a substring oiXAoi+i, then it is not 
consistent with 04. Thus, the tt satisfies the Claim. 

Case 5 : 7r[x/4l] = 04. In this case, the tt contains one of OiX^Oi+i, OiX^l^Oi+i 
and OiA^xOi+i- Since it is not consistent with 03, this case is removed. Thus, 
the Claim is true. 

We set PI = {-K G Pat \ 03, «4 G L{tt), ai,a2 ^ T(7r)|. By the above Claim, 
the proof of the theorem is now reduced to the consistency that the 3 -CNF C is 
satisfiable iff there exists & tt G PI such that ^{M) fl L{tt) = 0 for all (p G <Ie- 

Assume that the CNF C is satisfiable. There exists a truth assignment / : 
(xi, X2, . . . , x„| !->■ { 0 , 1 } such that for each clause Ci = (£i^ V V £13) (1 < f < 
to), at least one £' G {£i^,£i2,£i3} fulfills exactly one of the following conditions. 



Case (a): £' = xj G |xi, . . . , x„} and f{xj) = 1 , or 
Case (b): £! = ~^Xj and f{xj) = 0 . 
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There exists a one-to-one mapping from truth assignments / for xi, . . . 
to the set PI such that for each 1 < i < n, f{xi) = 0 iff tt contains UixAui+i 
and f{xi) = 1 iff TT contains aiAxai+i. 

If an / satisfies each clause Ci, then there exists a negative example Pi G Af 
constructed from the Ci such that xj G {£i^,£i^,£i^} iff Pi contains ajBAaj+i 
and -IX j G {£i^,£i2,£i;j} iff Pi contains ajABaj+i. 

In Case (a), a, n G PI corresponding to the / contains ajAxaj+i and the 
Pi constructed from Ci contains ajBAaj+i. We note that for any tt G PI and 
Pi G Af, |7 t| = \Pi\. Thus, as compared with the strings ajAxaj+i and ajBAaj+i, 
no substitution u G and no (p G I>s satisfy 7r[a:/u] = ‘p{Pi). 

In Case (b), analogously a tt corresponding to the / contains ajxAaj+i and 
the Pi constructed from Ci contain ajABaj+\. Thus, there is no substitution u 
and ip for 7r[a:/u] = p{Pi). 

Hence, if C is satisfiable, then there exists an G PI such that p{Af)r\L{n) = 0 
for all p G (As- 

Conversely, let the CNF C be unsatisfiable. Then, for any truth assignment 
/, there exists a clause Ci of C such that for any literal £' of Ci, either £' = xj G 
{xi , . . . , x„} and f{xj) = 0 or £' = -iXj and f{xj) = 1. 

Then, for each £' G {£i,,,£i2,£i;j}, the tt contains ajxAaj+i and the Pi G Af 
contains ajBAaj+i if £' = x^ G {x\, . . . ,x„} and the tt contains ajAxaj+i and 
the Pi contains ajABaj+i if £' = ~iXj. 

Finally, we define p{Pi) = w such that for each 1 < A: < \Pi\, 



^ \7r[A:], otherwise. 

It follows p{Pi) G A(n) for the substitution B. Hence, if C is unsatisfiable, 
then for any tt G PI, there exists a p G •As such that p{Af) fl L{tt) yf 0. Thus, 
we conclude that the 3-CNF C is satisfiable iff there exists a robust extension 
for V and N . Q.E.D. 



Theorem 2. re is NP-complete, even if HF'II = 2. 

Proof. Analogously, we reduce a 3-CNF C to RE. Let C consist of m clauses 
Cl, C2 , . . . , Cm and defined by n variables X\,X 2 , ■ ■ ■ , Xn- Initially, we set E = 
{A,B} and V = {a\ = A^”, «2 = A^"}. The set N of negative examples over 
E is the sum of the sets Ni, N 2 and A3 defined as follows. 

Ai = {A"} U {Pi = A2"+* I i G {1, . . . , 2n} \ {n, 2n}}, 

N2 = {Pj = I 1 < i < n} and 

AV3 = {/3fc = 7fci • • • 7fc„ I 1 < A: < m} such that 



[BA, 


if Xj is a literal of C^, 


1 < Vfc < TO and 1 < Vj < n, 7fc . = < AB, 


if -IX j is a literal of Ck and 


Uc 


otherwise. 
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We define the set PI C Pat such that tt S P/ iff tt = X 1 X 2 ■ ■ ■ Xn, where 
Xi G {Ax,xA} and 1 < i < n. Then, we prove the following Claim. 

Claim: tt G PI iSn G Pat is consistent with V and Ni U(p{N2) for all G 'Ps- 

First, assume tt G PI. Then, clearly, V C L{tt). For each g 

tt[x/u] yf because i G {1, . . . , 2n}\{n, 2n} and tl(7r. A) = jl(7r, x) = n. Thus, 

A^inL(7r) = 0. For each odd number i such that 1 < i < 2n — 1, 7r[i]7r[i + l] = Ax 
or 7r[i]7r[i + 1] = xA. On the other hand, for any (3j G N 2 , there exists exactly 
one odd number i such that Pj[i]f3j[i + 1] = BB. Since |7 t| = \^j\, there exists 
no G Ps such that ip(Pj) G L{tt). Thus, if tt G PI, then tt is consistent with 
(/j(P) and U N 2 ) for all tp G Pe- 

Suppose to the contrary that there exists a tt G Pat and tt ^ PI such that 
TT is consistent with V and U ip{N2) for all (p G Pe- If tJ" contains a B, then 
clearly, P 0 L{tt) = 0. Thus, we can assume tt G {A, x}~^ . 

We first show that tt must satisfy 7r[a;/A] = ai = Assume that 7r[a;/w] = 
and u = A’', where 2 < r < 2n. Then, jl(7r,a::) = n' < n and jl(7r,A) = 
2n — rn' . If n' = n, then tt = x". Since x"[x/A] = A" G iVi, it is a contradiction. 
Thus, n' < n — 1. For each 1 < n' < n — 1, there exists a /3„' G A^i such that 
Pn' = A^"+" = j^2n-rn ^ follows the Contradiction 7 t[x/A’’+^] = /?„/. 

Thus, 7t[x/A] = A^". 

Second, we show that tt must satisfy tl(7r,x) = n. Assume that ji(7r,x) = 
n' yf n. In case of n' < n — 1, there exists /?„/ G Ni such that /?„' = A^"+" = 
j^ 2 n-n ^ xhus, /?„' G L{tt) for the substitution A^. In case of jl(7r,x) = 
n' > n + 1, since 01,02 G L{tt), it holds that n' < 2n — 1. Thus, there exists 
Pn' G Ni such that /3„/ = A^"+" = A^" ^ 2 n-n ^ follows the contradiction 
7 t[x/A^] = Pn'. Thus, jl(7r,x) = n. By the condition that 7 t[x/A] = A^" and 
jl(7r, x) = n, the tt must satisfy that {((tt. A) = '^{tt, A) = n. 

Since tt G {A,x}+, tt ^ PI and tl(7r,A) = jl(7r,A) = n, there exists an odd 
number i and 1 < i < 2n — 1 such that 7r[i]7r[i + 1] = xx. On the other hand, 
since {i + l)/2 is a positive integer, there is a the string P(i+i )/2 G N 2 such 
that it is of the form Thus, there exists & :p G Pe such that 

for each 1 < fc < 2n, (^(/3(j_|_i)/2)[fc] = P if 7r[fc] = x and (^(/3(,_|_i)/2)[fc] = A 
otherwise. It follows the contradiction 7 t[x/P] = P(i+\)/ 2 - Hence, if a tt G Pat 
consistent with V and A^i U ip{N 2 ) for all (p G Pe, then tt = X 1 X 2 ■ ■ ■ Xn such 
that Xi G {Ax, xA} and 1 < i < n and the Claim is true. 

Similarly to the proof of Lemma 0 we can see that the 3-CNF is satisfiable 
iff there exists a tt G PI consistent with (f{N3) for all (p G Pe - Thus, we conclude 
that the RE is NP-complete, even if IIAH = 2. Q.E.D. 

Example 2. Let C be a 3-CNF of 5 variables. Then, V = {A^*^, A^^}. The A^i 
and N 2 in Theorem 0 are defined as follows. 



Ni = [A^, A“, A12, . . . , A^\ AI6, . . . , A 19 } and 
N 2 = {PP*®, iPBBiP, *4PP*4, iP^BB-iP, *®PP}. 
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Let us take a tt G Pat such that ■k\x/'u\ = for a substitution u G 
li A, then u = A" for some 2 < n < 10. Let |(7r, x) = m, where 1 < m < 5. 
If m = 5, then tt = x^ and G L{tt). Thus, this tt is inconsistent with Ni. For 
each 2 < n < 10 and 1 < m < 4 such that tt\x/A'^] = A^^ and tl(7r, x) = m, there 
exists a, w G Ni such that 7r[a;/A"+^] = w. Thus, if a tt is consistent with TVi, 
then 7r[a;/A] = A^^. 

If tt(7T,a;) < 5, then there exists 1 < n' < 4 such that G iVi fl L{tt) 

and if {((tt, a;) > 5, then there exists 6 < n' < 9 such that G iVi fl L{n). 

Thus, these tt are inconsistent with Ni. It follows that if a tt is consistent with 
A^i, then tJ(7r,a:) = 5. Hence, '^{'k,x) = ji(7r, H) = 5. 

For example, let tt = xAxxAAxAAx ^ PI. Then, there exists a (p G 
and w = -k^BBiA G N 2 such that ip{w) = BABBAABAAB G L{'k). Let tt = 

xAxAAxAxxA G PI. Then, there exists no w G IV 2 and ip G I’s such that 

<p{w) G L{'k). 

5 Conclusions 

We have studied the decision problems E, CE and RE as well as the membership 
problems 3Mem(7r, w) and VMem(7r, w) for one- variable patterns with respect to 
incomplete strings. Although the problem re is shown to be NP-complete even 
if the size of an alphabet is 2, it is still an open question whether the problem 
CE is in a subclass of NP. Also, the tractability of problem E remains open. 

In the first part of this paper, we showed the equivalence of E and RCE 
for polynomial-time reductions. This means that incomplete negative examples 
make problem RCE no more difficult than problem E. Then, our next interest is 
whether there is a computational gap between CE and E. Since 3Mem(7r, w) G P, 
we can check whether a one- variable pattern is consistent with given positive and 
negative examples in polynomial time for both problems. The critical part to be 
overcome is an effective representation of all consistent one- variable patterns for 
given incomplete strings. 

For constant strings, Angluin introduced the one-variable pattern auto- 
mata. They can be thought of as a clever data structure to represent exponen- 
tially many consistent patterns in polynomial space. Let L{A) denote the set of 
one-variable patterns accepted by A. Then, for any given one-variable pattern 
automata A and B, we can compute a one-variable pattern automaton A' such 
that L{A') = L{A) n L{B) in time polynomial in the size of A and B (cf. PJ). 
Thus, we can find a descriptive pattern for a given set of strings by enumerating 
at most polynomially many patterns. Hence, for attacking CE and E further, it 
may be promising to generalize Angluin’s Q one-variable pattern automata to 
our incomplete strings. 
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Abstract. We consider a data mining problem in a large collection of 
unstructured texts based on association rules over subwords of texts. A 
two-words association pattern is an expression such as 

(TATA, 30, AGGAGGT) ^ C 

that expresses a rule that if a text contains a subword TATA followed by 
another subword AGGAGGT with distance no more than 30 letters then 
a property C will hold with high probability. The optimized conhdence 
pattern problem is to compute frequent patterns (a, k, fi) that optimize 
the conhdence with respect to a given collection of texts. Although this 
problem is solved in polynomial time by a straightforward algorithm that 
enumerates all the possible patterns in time 0(n®), we focus on the de- 
velopment of more efhcient algorithms that can be applied to large text 
databases. We present an algorithm that solves the optimized conhdence 
pattern problem in time 0(max{fc, m}n^) and space O(fcn), where m 
and n are the number and the total length of classihcation examples, 
respectively, and fc is a small constant around 30 ~ 50. This algorithm 
combines the sufhx tree data structure in combinatorial string matching 
and the orthogonal range query technique in computational geometry for 
fast computation. Furthermore for most random texts like DNA sequen- 
ces, we show that a modihcation of the algorithm runs very efhciently in 
time 0(fcn log® n) and space 0(kn). We also discuss some heuristics such 
as sampling and pruning as practical improvement. Then, we evaluate 
the efficiency and the performance of the algorithm with experiments on 
genetic sequences. A relationship with efficient Agnostic PAC-learning is 
also discussed. 



1 Introduction 

The recent progress of communication and network technologies, e.g., electronic 
mail, World Wide Web, and inter/intra networks makes it easy for computer 
users to accumulate a large collection of unstructured or semi-structured texts 
on their computers at a low cost Such text databases may be collections of 

* Presently working for Fujitsu LTD. 
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web pages or SGML documents (OPENTEXT Index |2E|)) protein databases in 
molecular biology (GenBank ^3), online dictionary (OED IS]), or plain texts 
on a file system. There has been a potential demand for efficient discovery of 
useful information from text databases beyond the power of the present access 
methods in information retrieval 111191281 . 

Data mining is a research area that aims at development of semi-automatic 
tools for discovering useful information from large databases. Data mining has 
emerged in early 1990’s and has been extensively studied in both practice and 
theory mm- However, the present data mining technologies mainly deal with 
well-structured data such as relational databases with Boolean or numeric attri- 
butes lamisi, and thus are not directly applicable to those unstructured text 
data. Because, a text database is simply a collection of unstructured strings and 
the amount of the data is huge, which typically ranges from mega (10®) bytes 
to tera (10^^) bytes. We concentrate on efficient and robust discovery methods 
that work for a large collection of unstructured texts 

In this paper, we consider the discovery of very simple patterns called k- 
proximity two-words association patterns. Given a collection S of texts and an 
objective condition C over S, a k-proximity two-words association pattern is an 
expression of the form 

(TATA, 30, AGGAGGT) ^ C 



that represent a rule that if a text contains a subword TATA followed by another 
subword AGGAGGT with distance no more than /c = 30 letters then the objective 
condition C will hold with a probability. Although this class of rules seems very 
restricted, they are more flexible to describe a local similarity in text data with 
context information than unordered collections of keywords or single subwords. 
Hence, this kind of rules are frequently used for applications in bioinformatics 
ca, bibliographic search ESI, and Web search m- Further, the simplicity of the 
class allows robust and efficient learning in noisy environment as we will show 
in this paper. 

As the framework of data mining, we adopt the optimal pattern discovery 
recently proposed by Fukuda et al. EDI. In their framework, a discovery algorithm 
receives a sample set S with an objective condition C : S' — >■ {0, 1} and finds 
all/some patterns P that maximize a certain criterion. Based on this framework, 
we consider the efficient solvability of the optimized confidence pattern problem 
in data mining (Fukuda et al. 1 1 Dj ) and the empirical error minimization problem 
in computational learning theory (Kearns, Shapire, Sellie E2|; Maass 

The optimized confidence patterns can be computed in time 0(n®) by a 
straightforward algorithm that enumerates 0{n‘^) possible two-words associa- 
tion patterns since there are at most possible 0{n^) subwords of A. However, 
this polynomial is too large to apply this algorithm to real applications. To the 
problem, in Section 3 and in Section 4, we present variations of algorithms that 
compute all the two-words association patterns (a, fc, j3) using data structures 
from string matching and computational geometry, the suffix tree and the ortho- 
gonal range query. The idea is to reduce the discovery of such patterns to that 
of axes-parallel rectangles over the 2-dimensional plane of suffix ranks. 
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In Section 3, we present an algorithm that runs in time O {mn? log^ n) and 
in space 0{kmn\ogn). where m and n are the number and the total size of 
texts, respectively, and k is a, proximity. Next in Section 4, implementing the 
orthogonal range queries directly over the suffix tree, we give a modified version 
of the algorithm that runs in time 0(max{fc, m}n^) and space 0{kn) in the 
worst case. For most of nearly random texts like DNA sequences, it is known 
that the depth of the suffix tree is bounded by O(logn). Then, we show that 
our algorithm can be modified to run very quickly in time 0(fcn log^ n) for such 
random texts. 

In Section 5, we describe an application to computational learning theory. 
We show that the algorithm in Section 4 can be modified to efficiently solve 
the empirical error minimization problem. As a corollary, we obtain an efficient 
Agnostic PAC-learner for the class of fc-proximity two-words association pat- 
terns. Nakanishi et al. study the consistency problem, which is a special 
case of the empirical error minimization problem, for the class of single sub- 
strings. Since a single substring is a special /c-proximity two- words association 
pattern, our result also generalizes their result. 

In Section 6, we introduce some heuristics and examine their performances. 
In Section 7, we evaluate the efficiency and the performance of our algorithm 
from the empirical view point with experiments on genetic sequences from Gen- 
Bank databases. Finally, Section 8 concludes the results. 



2 Preliminaries 

2.1 Texts and Patterns 

Let H be a finite alphabet. We always assume a fixed total order on the letters in 
S. For a string s and a set S of strings, we denote by |s| and by size{S) the length 
of s and the total length of the strings in S. If there exist some u,v,w G E* such 
that t = uvw then we say that m, v and w are a prefix, a subword and a suffix 
of t, respectively. A text is any string t over E. For 1 < i < |t|, we denote the 
i-th letter of t by t[i]. An occurrence of s G E* in t is a positive integer i such 
that t[i] • • • -I- |t| — 1] = s. 

Let A: be a nonnegative integer. A k-proximity two-words association pattern 
(or pattern, for short) is a triple P = {a, k, (3), where a, (3 G E* are strings over 
E and k > 0 is called a proximity. A proximity two-words association pattern P 
matches a string t G E* if there exist a pair of integers p and q such that p and 
q are the occurrences of a and /? in t, respectively, and satisfy 0 < q — p < k. 
The pair (p, q) is called an occurrence of P in t. For a set S, card{S) denotes the 
number of elements in S. For nonnegative integers i,j, [i..j\ denotes the interval 
{i,i 1, ■ ■ ■ , j} if i < j and 0 otherwise. 
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2.2 Data Mining Problem 

A sample is a finite set S = of strings over S and an objective 

condition over 5 is a binary labeling function C : S' — >■ {0, 1}. Each string Si € S 
is a document and C{si) is its label. A document Si is positive if C{si) = 1 
and negative otherwise. Let a G {0, 1} be any label. We denote the set { s G 
S I C(s) = a} by Sq,. For a subset T of S, we define count{P, T) to be the number 
of documents s G T that P matches. 

The first problem to consider is the confidence maximization [ni]- For a 
pattern P, we define the support of P by supps{P) = count {P, S\) / car d{S) 
and the confidence of P by confs{P) = count {P, S\) / count{P, S). A minimum 
support is any real number 0 < cr < 1. A pattern P is said to be a -frequent if 
supps{P) > cr. 

Definition 1. Optimized Confidence Pattern Problem 
An instance is a five-tuple {E, S, C, k, a) of an alphabet E, a sample set S, an 
objective condition C over S, constants k > 0 and 0 < cr < 1. The problem is 
to find a a-frequent k-proximity two-words association pattern P that maximizes 
confsiP). 

Intuitively, the optimized confidence pattern problem is to find an implication 
P ^ C with highest conditional probability among those rules that can apply 
to at least cr percent of the documents in S. In Section 3 and Section 4, we give 
algorithms for efficiently solving the optimized confidence pattern problem. 

The second problem to consider is the empirical error minimization |lVI2l)j . 
Let S' be a sample, C an objective condition over S and P a A:-proximity two- 
words association pattern over E. We define P{s) G {0, 1} is 1 iff P occurs in s. 
The empirical error of a pattern P with respect to S and C is the number of the 
documents in S misclassified by P, that is, errors fi{P) — \P{s)-C{s)\. 

Definition 2. Empirical Error Minimization Problem 
An instance is a four-tuple {E, S, C, k) of an alphabet E, a sample set S, an 
objective condition C over S, a constant k > 0. The problem is to find a k- 
proximity two-words association pattern P that minimizes error s,c{P) ■ 

As we will see in Section 5, the empirical error minimization problem is 
closely related to a learning problem in noisy environments. 

2.3 Suffix Trees 

A suffix tree is a data structure for storing all subwords of a given text in very 
economical way (McCreight |25)- Fet A = a\a 2 ■ ■ ■ Un-iS be a text of length n. 
We assume that the text always terminates with a special symbol $ ^ E distinct 
from any letter including itself. For each 1 < p < n, we define the suffix starting 
at position phy Ap = Op - ■ ■ a„_i$. 

Then, the suffix tree for text A is exactly the compact trie for all the suffices 
of A, that is, obtained from a trie for A by iteratively removing the internal 
nodes with only one child and merging the labels of the removed edges. 
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More precisely, the suffix tree for A is a rooted tree Tree a that satisfies the 
following conditions, (i) Each edge is labeled by a subword a of A, which is 
encoded by a pair (p, q) of positions that points an occurrence of a in A, that is, 
A[p]A[p + 1] • • • [g] = a. (ii) The labels of any two edges leaving from the same 
node start with mutually distinct letters, (iii) Each node v represents the string 
Word{v) obtained by concatenating the labels on the path from the root to v 
in this order, (iv) For 1 < i < n, the i-th leaf li represents the suffix of rank i in 
the lexicographic order over all the suffices of A. 

Let a be a subword of A. The locus of a in TrecA, denoted by Locus{a), is 
the unique node v of Tree a such that a is a prefix of Word{v) and Word{w) is 
a proper prefix of a, where w is the parent of v. 

From (iv) and (iii) above Tree a has exactly n leaves and at most n— 1 internal 
nodes, and thus from (i) it requires 0{n) space representing O(n^) subwords of 
A. Furthermore, McCreight (1976) gives an elegant algorithm that computes 
Tree A in linear time and space. It is known that the average height of a suffix 
tree for a random string of length n is O(logn) 0. This is also the case for 
genetic sequences. 



2.4 Orthogonal Range Query 

Let n be a positive integer. Assume that we are given a finite collection X 
of points over a discrete two-dimensional plane [l..n] x [l..n]. An orthogonal 
range query is to find all the points in X that are included in a given rectangle 

[X 1 ..X 2 ] X [yi-y^]- 

Several solutions have been proposed for the problem, and among them we 
adopt a method described in Preparata and Shamos m for its simplicity alt- 
hough it is not optimum in computation time. Their solution uses a data struc- 
ture called the orthogonal range tree that requires 0(m log m) space, O(mlogm) 
preprocessing time, and 0(log^ m) time per query, where m is the number of 
points in X. For the algorithm in Section 4, we extend this data structure to 
search over the suffix tree. 



3 The Mining Algorithm 

In this section, we first show that there exists an efficient algorithm that compu- 
tes optimized confidence patterns in time 0{mn^ log^ n) and space 0{kmnlogn) 
using the suffix tree and the orthogonal range tree as its data structures. Then, 
in the next section, we show that we can make orthogonal range queries directly 
over the suffix tree instead of the range tree. This yields a faster algorithm for 
the optimized confidence pattern problem. 

Figure Q shows our data mining algorithm Find-Optimal, which finds the 
optimized confidence patterns. The keys of the algorithm are steps to enumerate 
patterns in canonical form and to compute supp{P) and conf{P) quickly. Let 
{X, S, C, k, a) be an instance of the optimized confidence pattern problem. 



252 



H. Arimura et al. 



Procedure: Find_Optimal\ 

Given: a sample S = {si, . . . , Sm}, an objective condition C, 
a proximity k > 0 and a minimum support 0 < cr < 1. 

Output: the optimized confidence patterns {a, k, (3) in canonical form. 

Variable: a priority queue Q. 

begin 

1 Q := 0; 

2 A ■.= si$ ■ • ■ $Sm$ and compute doc; 

3 Build the suffix tree Tree a and suffix arrays suf,pos. 

4 Compute Diagk and Rankk from A. 

5 Foreach node v do compute I{v); 

6 Foreach node u in TreeA do /^Traversing TrecA from the root to the leaves*/ 

7 Foreach node v in TreeA do /^Traversing TreeA from the root to the leaves*/ 

8 P - (Word{u), k, Word{v))-, 

9 Compute count{P, Si) and count{P, So) by making 

10 an orthogonal range query l{u) x I{v) for Rankk', 

11 Compute supp{P) and conf{P); 

12 if suppiP) > (T then insert P into the priority queue Q with the key conf{P); 

13 end; 

14 end; 

15 Output all the optimized confidence patterns in Q; 

end 



Fig. 1. An algorithm for discovering the optimized confidence patterns 



3.1 Enumerating the Patterns in Canonical form Using a Suffix 
Tree 

Let $ ^ U be a symbol such that $ $. Given S = {si, . . . , Sm} and C : S ^ 

{0,1} as input, our algorithm computes a single text A = si$---$SmS called an 
input text by concatenating all documents in S delimited with $. Let n = |A|. 
For each 1 < p < n, we define doc{p) = i if if i-th text Si G S includes p. 
Without loss of generality, we assume that there exists some 1 < p < m such 
that C{si) = 1 for all 1 < i < p and C{si) = 0 for all p < i < m. 

Next, we build the suffix tree TreeA for input text A by using a linear time 
algorithm in m- This tree TreeA is isomorphic to the generalized suffix tree 
(GST), that is, the compacted trie for all the suffices of documents in S, except 
the labels of the edges directed to the leaves (Amir et al. 0). Then, for each 
leaf V, we redefine Word{v) as the longest prefix of the suffix represented by v 
that contains no $’s. This is actually a standard method to build GST in linear 
time PI- 

Now, we introduce an equivalence relation =a as follows. For strings a,/3, 
a =A P iff Occ^(a) = Occ^(/3). For fc-proximity two- words association patterns 
P = {ai,k,a2),Q = P =a Q ai =a and 02 =A [^2- If P =a Q 

then we say that P and Q are equivalent . 
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Lemma 1. Equivalent patterns give the same value for supps(P) and confs(P)- 

Proof. By definition, equivalent patterns P, Q have the same set of the occur- 
rences for any text A. Thus, count{P,T) = count{Q,T) for any subset T C S. 
Since supps{P) and confs{P) are defined through count, the result follows. □ 

A pattern is said to be in canonical form if it has the form (Word{u), 
k,Word{v)) for some nodes u,v of TreCA- By definition, the number of k- 
proximity two-words association patterns is 0{n^). Let J_s G E* be an arbitrary 
string whose length is max{ |s| | s S S'} -I- 1. Clearly, = 0. 

Lemma 2. For any k-proximity two-words association pattern P, if P matches 
a document in S then there exists a equivalent pattern in canonical form. 

Proof. Let P = {a,k,j3). We show that for any subword a of A, there exists a 
node w of Tree a such that a =a Word{w) as follows. Suppose that we have 
the uncompacted suffix trie Tree a for A. Then, there exists a node v of Tree a 
that represents a. Let w be the highest descendant of v that has at least two 
children. Now, we map the nodes in Tree a into those in Tree a in a standard 
way. Then, we can easily see that v and w are mapped on a same edge in the 
compacted version TrecA. We know that subtree{v) and suhtree(w) have the 
same set of leaves, and thus they have the same set of occurrences in A. Since 
w = Locus{a) in TrecA, we have Word{w) =a ol. Hence, the lemma follows. □ 



Lemma 3. For any optimized confidence pattern among k-proximity two-words 
association patterns, there is an equivalent pattern P that is either in canonical 
form or J-s- 

3.2 Computing the Support Values Using Range Queries 

In this section, we show how to quickly compute the support and the confidence 
by using orthogonal range queries. The technique used here is basically due to 
Manber and Baeza-Yates The idea is to reduce the discovery of optimized 
confidence patterns to the discovery of 2-dimensional axes-parallel rectangles 
over the space consisting of the ranks of the lexicographically ordered suffices. 

Let Apj , Ap 2 , . . . , Ap^ be the sequence obtained by arranging all the suffices 
of A in the lexicographic order over E*. Here, Ap is the suffix of A starting at 
position p. Then, we store the indexes pi,P 2 , . . . ,Pn in an array suf : [l..n] — >■ 
[l..n] of length n in this order, and define the array pos : [l..n] — >■ [l..n] as the 
inverse mapping of suf. These arrays are called the suffix array (Gonnet and 
Baeza-Yates H3I; Manber and Myers |221). By definition, suf{i) is the position 
of the suffix of rank i and pos{p) is the rank of the suffix Ap. 

First, we observe that for any node v of TreeA, the set of the occurrences of 
Word{v) occupy a contiguous subinterval I{v) = [x\..X 2 ] in array suf . 
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Lemma 4. The subintervals I{v) for all nodes v are computed in 0{n) time. 

Proof. I{v) is computed by traversing TrecA from the leaves to the root in, e.g., 
depth-first search. Scanning the leaves from left to right, let I(vi) = [i..i] for the 
i-th leaf {i = 1, . . . ,n). Then, traversing TrecA from the leaves to the root, we 
visit each internal node v with its children vi, . . . ,Vh and let I{v) = [L..R], where 
L and R are the left boundary of I{vi) and the right boundary of I{vh). □ 

Secondly, the algorithm builds from S and C the diagonal set of width k, 
that is, the set of 2-dimensional labeled points 

Diagk = { (p, q] doc{p)) | 1 < p, Q < n, 0 < (q — p) < k, doc{p) = doc{q) }. 

Then, the algorithm transforms Diagk into the following set of 2-dimensional 
labeled points: 

Rankk = { (pos{p),pos{q);i) \ (p,q;i) G Diagk }■ 

For a fc-proximity two-words association pattern P = (a,k,P), we asso- 
ciate a 2-dimensional axis-parallel rectangle Rect{P) = I (a) x /(/3), and de- 
fine count{Rect{P) , Q) to be the number of distinct i such that {x,y,i) G 
Q n Rect{P) X where Q is any subset of Rankk. 

Lemma 5. For any k-proximity two-words association pattern P, count{P, S) = 
count(Rect{P) , Rankk). 

From Lemma 0 the problems of computing supps(P) and confs(P) redu- 
ces to the problem of computing count{Rect{P) , Rankk). From m, a standard 
argument show the following lemma. 

Lemma 6. Let I be any integer, Q C [1../]^ x be a set of 2-dimensional 

labeled points and R C [l..n]^ be any rectangle. Then, count{R, Q) is computable 
in time O(mlog^n) and space O(mnlogn) with preprocessing time O(nlogn), 
where n is the number of points in Q. 

Theorem 1. Let (E,S,C,k,a) be an instance of the optimized confidence pat- 
tern problem. Then, the algorithm Find-Optimal in Figure^ computes all the 
optimized confidence patterns in canonical form with proximity k and support 
threshold a in time Ofmn^ log^ n) and space 0{kmnlogn), where m = card{S) 
and n = size{S). 

Proof. First we build the suffix tree TrecA in linear time and space. Then, 
compute intervals I{v) for all node v in time 0{n) with dynamic programming 
(Preparata and Shamos 1985). From LemmaQand LemmaE] it suffices to search 
at most O(n^) canonical patterns P = {Word{u), k, Word{v)) by enumerating a 
pair u, V of nodes of Tree a. Then, we can see from Lemma and Lemma^that 
for each P, we can compute supp{P) and conf{P) in O(kmnlogn) preprocessing 
time, O(kmnlogn) space, and 0(mlog^ n) time per query. Note that the number 
of Rankk is at most kn. Since the number of possible patterns in canonical form 
is O(n^), this proves the result. □ 
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4 Modified Algorithm 

In this section, we present a modified version of our algorithm, that runs in time 
0{max{k,m}n^) and space 0{kn), In the algorithm, we implement an orthogo- 
nal range query mechanism over the suffix tree itself instead of a range tree and 
compute the value of count{Rect{P) , Rankk) by a dynamic programming in a 
similar way as Maass |2Dj- Figure El shows a modified version of our algorithm. 

Procedure: Modified_Find_Optimal\ 

Given: a sample S = {si, . . . , Sm}, an objective condition C, constants fe > 0, and 

0 < o- < 1. 

Output: the optimized conhdence patterns {a, k, j3) in canonical form. 

Variable: a priority queue Q; 

begin 

1 Compute A = si$ • ■ • $Sm$, and doc; Q := 0; 

2 Build the suffix tree Tree a and suffix arrays suf,pos\ 

3 Compute Diag^ and Rankk from S. 

4 Foreach node u in Tree a traversing from the leaves to the root do 

5 if u is the i-th leaf R then 

6 B{R) := { (t/, z) I (x, y, z) e Rankk, 3j/, }; 

7 if u is an internal node with children ui, . . . ,Uh then 

8 -B(w) ;= B{ui) U ■ • ■ U B{uh), and then discard B{ui), . . . , B{uh)\ 

9 Foreach node v in TreeA traversing from the leaves to the root do 

10 if V is the j/-th leaf ly then 

11 C{ly)-.= {{z)\{y,z)&B{u),3zy, 

12 if V is an internal node with children vi, ... ,Vh then 

13 C'(u) := C(vi) U • ■ • U C{vh) and then discard C{vi ), . . . , C{vh); 

14 P (Word{u),k,Word{v))-, 

15 Compute count{P, S\) and count{P, So) from C{v); 

16 if supp{P) > o then 

17 Insert P into Q with the key conf(P)-, 

18 end; 

19 end; 

20 Output all the optimized confidence patterns in Q; 

end 



Fig. 2. A modified algorithm for discovering the optimized confidence patterns 

Each node v of TreeA has linear list B{v) and C{v). Elements of B{v) are 
pairs (y,z) G [l..n] x [l..m] and are sorted in the y-coordinate y. Similarly, 
Elements of C{v) are labels (z) G [l..m] and are sorted in the z-coordinate z. 
For any x G [l-.n], we denote by y the cc-th leaf. 

Lemma 7. Suppose the algorithm ModifieTFinTOptimal visits node u in the 
loop at Line 4., visits node v in the loop at Line 9, and reaches Line 15. Then, 
C{v) is the ordered list of the z’s such that {x,y; z) G Rankk fl Rect{P)x[l..m]. 



Proof. Proved by induction on the height of the node u and u in a similar way 
as in Lemma El □ 
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Theorem 2. Let {E,S,C,k,a) he an instance of the optimized confidence pat- 
tern problem. Then, the algorithm Modified^Find^Optimal in Figure^ computes 
all the optimized confidence patterns in canonical form with proximity k and sup- 
port threshold a in time 0(max{fc, m}n^) and space 0{kn), where m = card{S), 
n = size{S). 

Proof. The correctness follows from LemmaQ LemmaEl LemmaEl and Lemma0 
Then, we estimate the running time. Line 1 to Line 3 take time 0(n+ kn). Since 
TrecA contains 0{n) nodes, the outer loop at Line 4 is executed 0{n) times, and 
thus the inner loop at Line 9 is executed O(n^) times. At each visit to a node u in 
the outer loop, each merge operation at Line 8 takes time at most 0{kn) because 
the length of the resulting list B(u) is obviously bounded by kn = card(Rankk). 
Line 6 takes total time 0{kn) in scanning all leaves. Similarly, at each visit to 
a node v in the inner loop, each merge operation at Line 13 takes time 0{hm) 
because each Cfvf) are sorted and its length is bounded by m, where h is the num- 
ber of children. Since the sum of the number of the children at all nodes is 0(n), 
amortized total time of Line 13 is 0{mn) and that of Line 11 is 0{kn) in scanning 
all leaves for a fixed u. Combining these observations, the total running time of 
the algorithm is T{n) = 0{kn)-\-0{kn)-\-n-{0(kn)-\-0{mn)) = O (max{ fc, m}n^). 

□ 

It is known that the height of the suffix tree is O(logn) for random texts 
generated by a uniform distribution By Theorem 0 below, we expect an 
algorithm to run very efficiently in time 0{knlog^ n) with a high probability for 
such random texts. 

Theorem 3. Let (E,S,C,k,a) be an instance of the optimized confidence pat- 
tern problem. Then, there exists an algorithm that computes all the optimized 
confidence patterns in canonical form with proximity k and support threshold a 
in time O(kd^nlogn) and space 0{kn), where n = size{S) and d is the height 
of the suffix tree. 

Proof. We observe that each layer, the set of nodes at the same level, of a suffix 
tree contains totally N = kn elements in B{u) {C{v)). In the modified algorithm, 
we attach with each node the pointer to its parent and a sorted list C{v) and 
B{u) of points {y, z) and {x, y, z). We use a balanced tree to represent each list 
C{v) so that we can perform both the insertion of an element and the counting 
of positive-labeled or negative-labeled elements in time logZ for I = card{C{v)). 
By using these, we traverse the tree levelwise from the leaves to the root. Thus, 
at each node u with list B(u), we can compute all C{v) in time O(dnologno) 
from Line 10 to Line 17 in Figure |3 where uq = card{B{u)). Repeatedly using 
this idea, we can derive the time O(kd'^nlogn) in k, d, n. We omit the details. □ 

5 Agnostic PAC-Learning and the Empirical Error 
Minimization 

Agnostic PAC-learning is a generalization of a well known PAC-learning model in 
computational learning theory, which is intended to capture learning situations in 
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real world In agnostic PAC-learning, a learning algorithm must work 

robustly with noisy environments, and even have the capability to approximate 
arbitrary unknown probability distributions. 

Haussler m and Kearns et al. m showed a close link between Agnostic 
PAC-learning and the empirical error minimization problem defined above. The 
Vapnik-Chervonenkis dimension (VC-dimension) is a measure of the complexity 
of a concept class (See, e.g. lltillTI for the definition). The class of /c-proximity 
two- words association patterns obviously has polynomial VC-dimension for any 
fixed fc > 0. 

Lemma 8 (Kearns et al. [IT] ). For any hypothesis class with polynomial VC- 
dimension, the polynomial time solvability of the empirical error minimization 
problem and the efficient agnostic PAC-learnability are equivalent. 

Now, we show that we can modify the algorithm Modified-Find_Optimal to 
solve the empirical error minimization problem in the same time and space com- 
plexity as the original version. 

Theorem 4. Let S be a sample, C be an objective condition over S, k > 0 be 
a fixed constant. Then, there exists an algorithm that solves the empirical error 
minimization problem for k-proximity two-words association patterns in time 
O(kd^nlogn) and space 0{kn), where n = size{S) and d is the height of the 
suffix tree. 

Proof. We can easily show that the empirical error minimization is obviously 
equivalent to the maximization of the difference As,c{P) = count{P,S\) — 
count(P,So). It is not hard to see that As,c is maximized by either a pat- 
tern in canonical form or Tg as in Lemma 0 Now, we modify the algorithm 
Modified_Find_Optimal in Figure 0 to compute a pattern P that maximizes 
As,c{P)- At Line 15 of the algorithm, we compute the quantities count{P, Sf) 
and count{P, So), and then compute As,c{P)- We skip the test at Line 16, and 
then at Line 17 we insert P into a priority queue Q with As,c{P) as the key. 
This modification correctly works, and hence the theorem follows from Theo- 
rem 0 □ 

Although the empirical error minimization problem is intractable for most 
concept classes, recently some geometrical patterns, e.g. axis-parallel rectangles 
and convex k-gons on Euclidean plain, are shown to be efficiently agnostic PAC- 
learnable EEDI- From Lemma 0 and Theorem 0 we know that there exists an 
agnostic learning algorithm that runs in time 0(fcn log^ n) if an input sample is 
ensured to be random. Hence, our result is one of a few results on the efficient 
agnostic learnability for nontrivial concept classes other than geometric patterns. 

6 Pruning and Sampling 

Pruning: Based on the monotonicity of the support of patterns in canonical 
form {W{u),k,W{v)), 
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— If M is a parent of v then supps{u) > supps{v), 

— If riivii{supps{W {u)) , supps{W {v))} < a then supps{(W{u),k,W{v))) < cr, 

we incorporate two pruning heuristics in the first algorithm: (I) Local pruning. 
Prune the descendants of u if supps{W{u)) < cr at some u. (2) Global pruning. 
Prune the descendants of v if supps{{W{u),k,W{v))) < a, where supp{a) is 
the support of a subword a. The local pruning is also possible in the second 
algorithm. By a similar argument to the proof of Theorem 0 we know that 
there are at most k(Pn canonical patterns of nonzero support for the height d 
of the suffix tree. Thus, we can expect that the efficiency of the first algorithm 
is improved with pruning for nearly random texts. 

Sampling: The modified algorithm in Section 0] achieves 0{mn^) time but it 
is not fast enough to be applied for huge text databases of several giga bytes. 
The following procedure approximates the solutions by using a random sampling 
technique. 



Given: a sample S consisting of n examples. 

begin 

Draws m documents from S according to the uniform 
distributions. Let Sm be the obtained sample. 

By using algorithm Find_Optimal, compute the optimized 
confidence patterns P with respect to Sm, and output P. 

end 

We set the sample size m to be, say, so that the algorithm works in 

almost linear time in n. The patterns computed by random sampling may give 
lower confidence than the patterns obtained from the original sample S. There- 
fore, we present empirical evaluations of the sampling heuristics by experiments. 



7 Experimental Results 



We run experiments on genetic data to evaluate the efficiency and the perfor- 
mance of our algorithms. The program was written in C based on the second 
algorithm in Section 01and run on Sun Ultra 1 workstation under the Sun Solaris 
2.5 operating system. The data were amino acid sequences of totally 2AKB from 
GenBank database We obtained 450 positive sequences related to the signal 
peptide and 450 negative sequences 450 from other sequences, and preprocessed 
the data by transforming twenty amino acids into three symbols 0, 1, 2 by indices 
due to Kyte and Doolittle (1982). For each m = 10 ~ 100 and each trial, we 
randomly drawn m positive and m negative sequences from the original sample 
S, and compute the optimum patterns for the obtained sample. 
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Confidence 





Fig. 4: The performance of sampling 



Running time. Figure 3 shows the dependency of the running time to the 
number m of documents, which is proportional to the sample size n. We see that 
the growth of the running time is slower than that we expected from Theorem 0 
For a larger sample of 24KB (to = 450), the algorithm takes one minute and over 
twenty mega bytes of main memory to find 37 optimal patterns with testing 676 
locally frequent patterns within 577296729 possible patterns. Examples of the 
optimal patterns are “12222”*” 2” and “0”*”2222” that achieve high confidence 
69% and 66% with support around 65%. 

Random sampling. Figure 4 show the performance of sampling heuristics 
for varying sample size. For each trial, first compute the best ten patterns for the 
sample Sm and then evaluate the empirical confidence conf on Sm and the real 
confidence cond* on S. We plotted the average of the confidence through 100 
trials for each to. The parameters were fc = 5 and a = 0.6. In Figure 4, we can 
see that the error is 5% for the confidence by 10% sampling. For the support, 
we had a similar result (12% error for the support). 

Table 1. The performance of pruning heuristics 



<7 % 


Sample size(bytes) 


Candidates 


Locally frequent 


Globally frequent 


90% 


2,701 


7,295,401 


121 


40 


50% 


2,754 


7,584,516 


1,444 


64 


30% 


2,764 


7,639,696 


6,241 


88 



Pruning. Table 1 shows the efficiency of the pruning heuristics. The sam- 
ple consists of 50 positive and 50 negative sequences of totally 2.7KB. The first 
and the second columns show the minimum support and the sample size, and 
the third, the fourth, and the fifth columns show the number of candidate pat- 
terns remain at the initial, after local pruning, and after global pruning. We can 
see that only a small fraction of patterns can be solutions for nearly random 
sequences. 
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8 Conclusion 

In this paper, we considered the problem of finding association rules over sub- 
words, and gave efficient algorithms that compute optimized confidence rules 
from a large collection of unstructured text data for the class of two-words as- 
sociation patterns. We discussed the performance of search heuristics and run 
experiments on genetic data. 

The algorithms presented in this paper extensively use the sujfix array data 
structure, which is a variant of the suffix tree that is more space efficient and sui- 
table for implementing advanced search functions in large text databases (Gon- 
net and Baeza-Yates H3; Manber and Myers m)- Thus, it is a future problem 
to develop scalable implementation techniques over suffix arrays that enables us 
to incorporate our algorithms into existing text databases |18l2fij . The study of 
secondary storage directed algorithms and the speed-up by parallel execution 
are other future problems. 

We developed efficient learning algorithm for classes of structured patterns 
EEEI- It will be interesting to extend the framework of this paper for the struc- 
tured patterns. 
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Abstract. In this paper we investigate inductive inference of recursive 
real-valued functions from data. A recursive real-valued function is re- 
garded as a computable interval mapping, which has been introduced by 
Hirowatari and Arikawa (1997), and modified by Apsitis et al (1998). 
The learning model we consider in this paper is an extension of the 
Gold’s inductive inference. We first introduce some criteria for successful 
inductive inference of recursive real-valued functions. Then we show a 
recursively enumerable class of recursive real-valued functions which is 
not inferable in the limit. This should be an interesting contrast to the 
result by Wiehagen (1976) that every recursively enumerable subset of 
recursive functions from A to A is consistently inferable in the limit. We 
also show that every recursively enumerable class of recursive real- valued 
functions on a fixed rational interval is consistently inferable in the limit. 
Furthermore we show that our consistent inductive inference coincides 
with the ordinary inductive inference, when we deal with recursive real- 
valued functions on a fixed closed rational interval. 



1 Introduction 

This paper investigates inductive inference of real- valued functions from exam- 
ples. Examples of real-valued functions obtained by experiments and observa- 
tions are numerical data which inevitably involve some ranges of errors. Hence 
such numerical data are represented by pairs of rational numbers approximating 
an exact value and an error bound respectively. Each of the numerical data can 
also be represented as a pair of an upper and lower bounds to the exact value. 
Thus it is regarded as an interval number HED, i.e. a closed interval contai- 
ning the exact value. Hence it is reasonable to regard real-valued functions as 
computable interval mappings. 
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Recursive real-valued functions as interval mapping has been introduced by 
Hirowatari and Arikawa P|, and then modified by Apsitis et al |2), in order 
to study learning of real- valued functions. On the other hand, computable real 
functions was introduced by Grzegorczyk and several other formulations, 
different from but equivalent to his, have been reported in 1711 011 'ill, ■-il . The 
formulations of computable real functions are convenient for discussing the com- 
putational complexity of such functions. However it is not always suitable to 
learning or inductive inference from examples. Our functions are implemented 
by recursive mappings of intervals, in contrast with the computable real func- 
tions. Furthermore our functions can enjoy the merits of not only computable 
interval functions but also computable real functions. Every partial computable 
real function is a recursive real- valued function |2j . These are why our functions 
are suitable for algorithmic learning. 

In 0 two approaches were exhibited for the learning of real- valued functions 
from examples by using computable analytic functions and arbitrary computable 
functions of recursive real numbers respectively. It was proved there that the set 
of continuous functions defined over an interval is learnable if and only if the 
interval is closed on both ends, and that the same is true for monotonic functions. 
Furthermore, Haussler 0 considered the problem of learning functions from X 
into Y, as a generalization of the PAG learning model. In his model, the learner 
receives randomly drawn examples (a;, ho{y)) G X xY for some unknown target 
function hp, and tries to find a decision rule h : X ^ A, in order to minimize 
the expectation of a loss l{y,a), where A, Y and A are arbitrary sets of reals, 
and I is a real- valued function. Our learning model, first presented in ||, differs 
from these models in pHHj . 

Our model is an extension of the Gold model 0 of inductive inference (cal- 
led REALEX -inference) to handle inference of real-valued functions. This is a 
process of hypothesizing recursive real- valued functions intended to explain the 
received numerical data. An inference machine requests input data from time 
to time, and identifies an algorithm which computes the target function in the 
limit. As we deal with real-valued functions as target functions, we need to con- 
sider the precision of the guesses from the inference machine. For this purpose, 
we introduce the notion of consistent inductive inference of recursive real- valued 
functions (called REALCONS-inference) as a successful identification criterion. 
In we have shown that every recursively enumerable class of recursive real- 
valued functions on a fixed rational interval is REAL CONS -infer able in the 
limit. 

In this paper we first propose some criteria for successful inference, and com- 
pare the classes resulting from the criterion of REALCONS that is the set of 
all consistently inferable classes of recursive real-valued functions. Then we show 
that REALCONS is properly included in REALEX, and show that REALNUM , 
the collection of all recursively enumerable sets of recursive real- valued functions, 
is not included in REALCONS . REALNUM is not included in REALEX, alt- 
hough Wiehagen showed that every recursively enumerable subset of recursive 
functions from N to N is consistently inferable in the limit m- We then con- 
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sider inferability of recursive real- valued functions on a fixed domain. We will 
focus our attention on functions on a rational interval. In the context of infe- 
rence of recursive real-valued functions on a fixed open or half-open rational 
interval, we show that REALNUM is properly included in REALCONS , which 
makes an interesting contrast with the above result that REALNUM is not inclu- 
ded in REALEX . Furthermore, we show that REALCONS is properly included 
in REALEX , in the context of inference of recursive real- valued functions on 
a fixed open or half-open rational interval. On the other hand, we show that 
REALCONS coincides with REALEX, in the context of inference of recursive 
real- valued functions on a fixed closed rational interval. 



2 Recursive Real- Valued Functions 

Let N, Q and R be the sets of all natural numbers, rational numbers and real 
numbers respectively. By 7V+ and Q+ we denote the sets of all positive natural 
numbers and rational numbers respectively. 

A recursive real number is a pair of two sequences of rational numbers con- 
verging to the number and rational numbers converging to zero. 

Definition 1. Let f he a funetion from N to Q, and g he a function from N to 
Q+. A pair (f,g) is an approximate expression of a real number x, if f and g 
satisfy the following conditions: 

1 . lim„_,.oo g(n) = 0. 

2. |/(n) — x\ < g{n) for any n. 

The number X is recursive real, if there is an approximate expression (f,g) of x 
such that f and g are recursive. 

f(n) and g{n) show an approximate value of the real number and an error 
bound at each point respectively. 

In this section we propose recursive real-valued functions which are closely 
related with computable real functions !«I71 1(11 ‘ill. ‘1| . They are implemented via 
recursive mappings of intervals. 

By a rational interval we mean an interval whose end points are rational. We 
sometimes call it just an interval when no confusion occurs. Let h : S ^ Rhe a, 
real-valued function, where S' C i? is the domain of h. Given S, we introduce a 
collection of rational intervals: Dorns C Q x Q+ which contains all sufficiently 
short intervals contained in S. Given Dorns, we also introduce a function Ah '■ 
Dorns — t Q X which maps rational intervals {p, a) € Dorns to rational 
intervals showing where the value h{x) is, provided that x € [p — a,p a]. The 
rationalized function Ah maps short intervals to short intervals, so that h can 
be computed with arbitrary precision. 

Definition 2. Let S Q R be a domain of some function. We say that Dorns Q 
Q X is a rationalized domain of S, if it satisfies the following conditions: 
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1. Every interval in Dorns is contained in S: If (p,a) G Dorns, then [p—a,p+ 
a] C S. 

2. Dorns covers the whole S: For any x € S there is (p,a) € Dorns such that 
X G [p — a,p + a]. Especially, if x € S is an interior point, then there is 
{p, a) G Dorns such that x G {p — a, p + a) . 

3. Dorns is closed under subintervals: If {p, a) G Dorns and [q — /3,q + P] C 
[p — a,p + a], then {q, /3) G Dorns- 

There exist rationalized domains Dorns, if and only if S can be expressed as 
unions of closed rational intervals. The same S can have different rationalized 
domains Dorns- 

Definition 3. Let h : S ^ R be a real-valued function, and let S have a rationa- 
lized domain Dorns- A rationalized function ofh, denoted by Ah, is a computable 
function from Dorns to Q x which satisfies the following condition: 

For any x G S and any approximate expression (f,g) of a number x, 
there exists an approximate expression (/o,5o) of the number h{x) such 
that for all n G N, {f{n),g{n)) G Dorns implies Ah{{f{n),g{n))) = 
{fo{n),go{n))- 

If / and g are recursive, then there exist recursive functions /o and go satis- 
fying the above. Thus the function h above satisfies the condition that h{x) is 
recursive real for any recursive real x G S- 

Definition 4. Let h : S ^ R be a real-valued function- Then h is said to be 
a recursive real-valued function, if there exists a rationalized domain Dorns of 
S, and a rationalized function Ah '- Dorns — >■ Q x Q'*' of h- We demand that 
Ah{{p,oi)) does not halt for all (p,a) ^ Dorns - 

From these definitions we can design Ah as the following algorithm: For h, 
Ah takes a pair {p, a) G Q x Q as an input, and produces Ah{{p,(f)) and stops 
if {p,a) G Dorns, else it never halts. Thus we sometimes say that Ah is an 
algorithm which computes h- 

A real-valued function h : S ^ R is computable if h(x) is recursive real for 
any recursive real x G S and there exists an efficient procedure to find h(x) from 
the given x- Thus our recursive real- valued function is computable. Furthermore 
the recursive real-valued functions satisfy the conditions required in the interval 
analysis m- 

Let h he a recursive real-valued function, and Ah be the algorithm that 
computes h- By Ah{{p, oi)), we denote the output of the algorithm for an input 
{p,a)- 

By ipj we denote a partial recursive function from N to N computed by a 
program j- Thus the set V = {ipo,Ti, T 2 , ■ • ■} is the set of all partial recursive 
functions from N to N- By d>j{i), we denote the step number to compute Pj(i) 
for a program j received an input i- For this set {po, Ti, T 2 , ■ ■ •}, the following 
recursion theorem holds: 
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Theorem 1. For any recursive function h from N to N, there exists a number 
i € N such that iph{i) = Fi- 

We can extend tpj G V to a, stair function defined bellow, and then treat ifj 
as a kind of recursive real- valued functions. 

Definition 5. Let ipj € V and let Sq be the domain of(pj. A function h : S ^ R 
is a stair function oftpj, if h satisfies the conditions: 

( 1 ) >S' = UieSo(*- 

(2) h{x) = ipj{i) for any x G (z — z -I- |), t G So- 

Proposition 1. Let (pj G V and let h be a stair function of ipj. Then h is a 
recursive real-valued function. 

Proof. Let So and S be the domain of tpj and h respectively. We define Dorns 
as the set of all {p, a) G Q x such that [p — a,p -G a] C and 

^ for an I G So, and a computable function Ah ■ Dorns Q x Q+ 
by Ah{{p,a)) = {(pj{m),a), where m G N , \p — m\ < — a. Then Dorns is a 

rationalized domain of S. Therefore we show that Ah is a rationalized function 
of h. Let X G S, and (f,g) be approximate expression of x. Now we construct 
the following functions fo and go from N to Q: 

fo{n) = 

9o{n) = g{n) 

where K G ^o, \x—K\ < Note that K can be computed uniquely, because there 
is a number t such that [f{t)—g{t),f{t)-\-g{t)] C {K-^,K-\-\). Therefore (/o,5o) 
is an approximate expression of h{x). Furthermore it holds that {f{n),g{n)) G 
Dorns implies Ah{{f {n) , g{n))) = {fo{n),go{n)). Hence Ah is a rationalized 
function of h. □ 

For any given algorithm A that computes a stair function of pj G V, we can 
easily construct a program j which receives n G N as aa. input, and works as 
follows: If an input n G zV is in the domain of (pj, then j outputs Pj{n) else it 
never stops. 

program: j 
begin 

let n G N be an input; z := 0; j := 0; zz := 0; T := 0; 

while T = 0 do begin 

if A{{n, ^)) has an output in at most j steps then {q,(3) := A{{n, ^)); 
while I -\- \ < q-\- (3 do I := l-\- t] 

[q — (3 , q [3] C {I — ^,l ^ then output I and T := 1 

else j := j — 1 and z := z -I- 1; 
if J < 0 then n := rz -I- 1, j := rz and z := 0 
end 
end. 
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3 The Model of Learning 

In our scientific activities we cannot observe the exact value of a real number x, 
but we can observe approximations of x. Such approximations can be captured 
by a pair (p, a) of rational numbers such that p is an approximate value of the 
number x and a is its error bound, i.e., x € [p — a,p + a]. We call such a pair 
(p, a) a datum of x. 

Definition 6. Let S Q R, and let h : S ^ R be a function. A datum of a 
function h is a pair {{p,a), {q,/3)) such that there is an x € S such that (p,a) 
and {q, j3) are the data of the numbers x € S and h{x) respectively. 



Definition 7. A presentation of the function h is an infinite sequence wi,W 2 , - ■ ■ 
of data ofh in which, for any number x in the domain ofh and any C > 0, there is 
awk = {{pk,ak),{qk,ldk)) such that x G [pk~ak,Pk+ak], h{x) G [qk-fik^qk+fik], 
and ak,f3k < C- By a we denote such a presentation, and by cr[n] we denote the 
(j’s initial segment of length n. 

Definition 8. An inductive inference machine (IIM) is a procedure that requests 
inputs from time to time and produces algorithms that compute recursive real- 
valued functions from time to time. These algorithms produced by the machine 
while receiving data are called conjectures. 

The notion of a datum and a presentation for a real-valued function is more 
relaxed than that of rationalized function Ah- We do not require that a datum 
{{p,a),{q,P)) should satisfy h{[p — a,p a]) C [q — /3,q P]. Neither has the 

interval [p — a,p-|- a] to be wholly contained in the domain S. We now require 
a graph of h just to intersect each data box at some point. 

For an IIM A4 and a finite sequence a[n] = {w\,W 2 , ■ ■ ■ ,Wn), hy A4 (a [n]) we 
denote the last conjecture of the IIM AA after requesting data W\,W 2 , - ■ ■ ,Wn as 
inputs. In this paper we assume that M{a[n\) is defined for any n. 

Definition 9. Let a be a presentation for some function h. An IIM M{a[n\) 
converges to an algorithm Ah', if there exists a number uq € N such that 
A4(a[m]) equals Ah' for any m > no . 

A set T of recursive real- valued functions is said to be recursively enumerable 
if there is a recursive function T such that the set T is equal to the set of all 
functions computed by algorithms 'F(l), <F(2), • • •. 

Definition 10. Let So and S be subsets of R, ho be a function from So to R, and 
h be a function from S to R. The ho is a restriction of h (denoted by ho = h\sf,), 
if So Q S and ho{x) = h{x) for any x € So. We also say that h is an extension 
of ho. 

Since we do not distinguish a function from its extensions, we claim the 
success in learning even when an IIM converges to an extension of the target. A 
similar technique was used previously in 
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Let (fij S P be a recursive function from N to N , and let /i be a stair function 
of ipj. We define that a = For each we can 

define a constant function hi from {i — ^,i+ to {^j{i)}- Let p® = - ■ ■ 

be a presentation of hi with = {{i — \ + such that 

k,m & N , 2^ — k < n < 2^^^ — (fc + 1) and m = n + k + l — 2^. Furthermore 
let p/j = ici, W 2 , • ■ ■ be a presentation of h with such that s,t G N, 

bs(s — 1) < n — 1 < is(s+l) and t = bs(s+l) — n. We call a stair presentation 
of h. 

4 Learning Criteria 

An IIM A4 succeeds in learning a function h, if the algorithm Ah' , to which A4 
converges, computes an extension h' of h. In this paper we introduce different 
criteria of success to formalize the statement that an IIM M learns a target 
function h. Such criteria are REALEX, REALCONS REALSCONS, REALEIN 
and REALNUM . Two of them, REALEX and REALCONS , are from |2|. 

Definition 11. Let h be recursive real-valued function. An LLM M is said to 
learn h in the limit (denoted by h G REALEX {A4)), if for any presentation a 
of h M converges to an algorithm Ah' that computes an extension of h. 

A class T of recursive real-valued functions is REALEX -mieiahle, if there is 
an IIM A4 which learns every ft, S T in the limit. By REALEX we denote the 
collection of all REALEX -inferable classes T of recursive real-valued functions. 

Consistent inductive inference was first studied by Wiehagen m to require 
that any program produced by an IIM be correct on all the data seen so far. 
This notion is extended to the case of real- valued functions in 0. 

Definition 12. Let T be a class of recursive real-valued functions. An LLM M 
is said to consistently infer T in the limit, if h G REALEX (M) and for any 
conjecture ft„ = Ai{(j[n]) and any {{p,a), (q,P)) G cr[n] such that [p — a,p-\-a] C 
S, there is an x G [p — a,p -\- a] such that ft„(x) G [q — 2/3, q -\- 2(3]. 

A class T is REALCONS-inierahle, if there is an IIM Xi which consistently 
infers every function ft S T in the limit. By REALCONS we denote the collection 
of all REALCONS-inlerable classes T of recursive real-valued functions. 

We can also formalize the consistency requirement in the sense of Wieha- 
gen jUj. 

Definition 13. Let T be a class of recursive real-valued functions. An LLM M 
is said to strongly consistently infer T in the limit, if h G REALEX {Xi) and 
a[n] is a set of data of hn for any conjecture hn = Xi{a[n\). 

A class T is REALSCONS-inlerable, if there is an IIM Xi which strongly 
consistently infers every function ft G T in the limit. By REALCONS we denote 
the collection of all REALSCONS-inierable classes T of recursive real-valued 
functions. 
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We recall that an IIM is a procedure that requests inputs from time to time 
and produces an algorithm that computes recursive real-valued function from 
time to time. Now we admit that an IIM requests input data from time to time 
and produces a unique algorithm that computes a recursive real-valued function. 

Definition 14. Let T he a class of recursive real-valued functions. An IIM A4 
is said to finitely infer T, if for any h gT and any presentation a ofh, the IIM 
M presented a ’s data outputs a unique algorithm that computes an extension of 
h after some finite time. 

A class T is REALFIN-mierahle, if there is an IIM A4 which finitely infers 
every function h gT. By REALFIN we denote the collection of all REALFIN- 
inferable classes T of recursive real- valued functions. By REALNUM , we denote 
the collection of all recursively enumerable sets of recursive real-valued functions. 

5 A Comparison of Identification Criteria 

In this section we compare identification criteria for inductive inference of recur- 
sive real- valued functions. It is obvious that REALSCONS C REALCONS C 
RE ALEX. 

Theorem 2. REALCONS RE ALEX . 

Proof. We show that REALEX \ REALCONS yf 0. Let U be the set of all 
recursive functions h from N to N such that there exists a number I G N 
such that h{l — 1) = 0, h{l) = j, h{n) > 0 for any n > I, and (pj = /i for a 
number j G N, and let T be the set of all stair functions of functions in U. 
Then T G REALEX. Now we show that T ^ REALCONS . Assume that there 
is an IIM Xi that REALCONS-iniers T. Let cr = wi, W 2 , • • • be a presentation 
and w be a datum. By M{a[n],w) we denote the last conjecture of the IIM M 
requested data Wi,W 2 ,--- ,Wn,w as inputs. For each i G N, we define a stair 
function rji which satisfies the following conditions: 

??*( 0 ) = 0 , 

= L 

r 1 if M{pJ^^[^k{k -G 1)]) yf M{pr,i[kk{k -G l)],di), 

r]i{k-G 1) = < 2 if 1)]) = -h l)],di) 

[ and M{pn,[^k{k -G 1)]) yf -h l)],d 2 ), 

where k G N~^,di = {{k-Gl, |), (1, |))), ^2 = ((^+1, |), (2, |))), and is a stair 
presentation of rji. For each i G N , rji is a recursive real- valued function, because 
M{p^^[\k{k + l)],((fc-h 1, i),(l,|))) yf M{pnA\k{k + 1)],((^ + 1,5)>(2>|))) 
for any k G N . By the recursion theorem, there is a number t G N such that 
pt = Vt- Thus it holds that pt G T and the progression {Xi{prjt[n])}neN does 
not converge, which is contradiction. Hence T ^ REALCONS . □ 
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Proof. The V = ‘ is the set of all partial recursive functions from 

N to N. Let be the set of all primitive recursive functions from N to 

{0, 1}, and let ^ be a function defined by z{n) = 0 for any n G N. There is a 
recursive function a from N to N such that = {ipa(o): <Pa(i)i Va{2)^ ' ' •}• 

Note that the set {i G N \ ipa(i) = z} is not recursively enumerable. 

It is trivial that REALSCONS C REALCONS . Therefore it suffices to show 
that REAL CONS \ REALSCONS yf 0. For any i G N, we define a function 
hi from [0, 1] to R by hi{x) = Si’^ce ^ recursive 

real number for any i G N, hi is a, recursive real-valued function. Let T = 
{/iQ, hi, h 2 , ■ ■ •}. Then T G REALCONS . Assume that there is an IIM M. which 
REALSCONS-inievs T. For any i G N, we define a presentation at = w\,W 2 , - ■ ■ 
of hi with wh = ((p, ^), li)) S'^ch that k,l,m G N , 2^ - k < I < 

2^+^ — k—\ and m = n—2^—n+l. By Ad((r[n], w) we denote the last conjecture of 
the IIM M requested data wi,W 2 , - ■ ■ ^ Wn, w as inputs. Since M REALSCONS- 
infers T, hi{x) = 0 for any x G [0,1] iff there is a number I G N such that 
Ej=o ^ - It < 0 < ^ + If any n<l and M{ai[l]) = M{ai[l],d), 

where d= ((i, i), {S^, ^)). Thus {i G iV | hi{x) = 0 for any [0, 1]} is recursively 
enumerable. It holds that hi{x) = 0 for any x G [0, 1 ] iff (fa(i) = z for any i G N. 
Therefore ii G N \ = z\ is recursively enumerable, which is contradiction. 

Hence rf REALSCONS. □ 



Theorem 4. REALEIN ^ REALSCONS . 

Proof. It is obvious that REALPIN C REALSCONS . Therefore we just show 
that REAL FI N\ REALSCONS 0. Let T be the set of all constant functions 

from [0, 1] to Q. Then T G REASCON S. Assume that there is an IIM Jti which 
REALEIN -inters T. Let Cr G T he a target function defined by Cr{x) = r, 
and wi,W 2 ,--‘ be a presentation of Cr such that < r < qn + Pn, 

where Wn = {{Pn, ctn), {Qu, Pn)) for any n G N. Since M. REALEIN -inters T, M 
requested wi,W 2 , ■ ■ ■ ,Wk outputs an algorithm which computes Cr tor a k G N. 
Put I := max{gi - Pi, - ■ ■ ,qk - Pk} and u := min{qi -\- Pi, ■ ■ ■ , qk Pk}- Then 
I < r < u. Since there exists a j G Q such that I < j < u and j ^ r, wi, ■ ■ ■ ,Wk 
are data of a constant function Cj. Let w'i,W 2 , ■ ■ ■ be a presentation of Cj such 
that wi = w'l, ■ ■ ■ ,Wk = w'k- Then M. requested w{, ■ ■ ■ ,w'f. outputs an algorithm 
which computes Cr, which is contradiction. Hence T ^ REALEIN . Consequently 
REALEIN 5 REALSCONS. □ 

Now we recall a result on identification criteria for inductive inference of 
recursive functions from N to N. Let TZ be the set of all recursive functions 
from N to N. Every recursively enumerable subset of TZ is consistently infera- 
ble in the limit H3), whereas the following theorem asserts that there exists 
a recursively enumerable class of recursive real-valued functions which is not 
RE ALEX -inferable . 



Theorem 5. REALNUM \ RE ALEX 0. 
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Proof. The V = is the set of all partial recursive functions from 

N to N . Let S be the set of all stair functions of functions in V. It is obvious 
that S € REALNUM . Assuming that there is an IIM M. which REALEX-raiers 
S. 

Let ipj be a target function in V, and h be the stair function of (pj. We define 
that Image{ipj) = {(n, Pj{n)) | n is in the domain of Pj}, and we call a sequence 
cr = {xo,(fij{xo)), {xi,(pj{xi)), ■ ■ ■ data of pj if {{xo,ipj{xo)), {xi,(fj{xi)),-- •} = 
Image(ipj). For each {xi, ipj{xi)) in the domain of (pj, we can define a constant 
function hi from {xi — \,Xi+ to {pj{xi)}. Let p* = wl,W 2 ,- • • be an infinite 
sequence of data of hi such that = ((i — | + p, P), {pj{xi), 2 bW))> where 
k and m are natural numbers such that 2^ — k < n < — (/c + 1) and 

TO = n + /c+l — 2^. Furthermore let = Wi, W 2 , • • • be an infinite sequence of 
data of h such that where s and t are natural numbers such that 

|s(s — 1) < n — 1 < ^s(s + 1) and t = |s(s + 1) — n. Since p* is a presentation 
of hi for any i G N , ph is a, presentation of h. Note that we can construct ph for 
any given a. 

Since IIM Ai which REALEX -iniers S, Ai{ph) converges to an algorithm Ah 
which computes an extension of h. Since h is a stair function of pj G V, we can 
construct a program j which receives n G N as an input, and works as follows: If 
an input n S fV is in the domain of pj, then j outputs Pj(ji) else it never stops. 
Thus there exists an IIM AAq that infers every pj G V in the limit, for any input 
data a of pj, which is contradiction. Hence REALNUM \ REALEX yf 0. □ 



Example 1. For each i G N, we define recursive real- valued functions 

by 



hi{x) 



1 if a; < 0, 

0 if X > p , 



hi{x) 



1 if X < 0, 

0 if X > p. 



hi and hi 



We also define a recursive real- valued function h by 



h{x) 



1 if X < 0, 
0 if X > 0. 



Let r = {h, /lo, hi,h 2 , ■ ■ •}. Then T G REALNUM \ REALEX. 

Now let (7i = w\,W 2 , wl, • • • be a presentation of hi with wlj_^={{0, ^)) 

such that j G N~^. Then hi is also a presentation of hi for each i G N. Therefore 
we can construct a presentation (jj = w\,W2,wl, ■ ■ ■ of hi for each i G N as 
follows: 

{ Wj if z = 0, 

if z > 0 and 1 < j < zzi_i, 
wp+fc if z > 0 and j = m-i + 2k, 
w\. if z > 0 and j = rzi_i + 2k — 1, 



where k G N~^, and rii is the least natural number such that A4(ai[n]) = 
A4(ai[ni]) for any n > rii. Let ct* = linii^oo c^i. Since ai is also a presenta- 
tion of hi for each z € iV, cr* is a presentation of h. By the definition of <7*, the 
progression {M(cr*[zz])}„gAr does not converge. Thus T ^ REALEX. □ 
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Theorem 6. REALFIN n REALNUM ^ 0. 

Proof. Let T be the set of all constant function from [0, 1] to N. Since T is 
recursively enumerable, T G REALNUM . For any constant functions Ci,Cj G 
T, it holds that \i — j\ > 1 iff i yf j. Thus T G REALFIN . Consequently 
REALFIN n REALNUM □ 

Theorem 7. REALFIN \ REALNUM 0. 

Proof. Let U be the set of all recursive functions h from N to N such that 
fh{o) = h, and let 'T be the set of all stair functions of functions in U . Then 
T G REALFIN . Since T is not recursively enumerable, we have that T ^ 
REALNUM. Consequently REALFIN \ REALNUM y^ 0. □ 

6 Learning Functions on a Fixed Rational Interval 

In the previous section we have considered several criteria for learning recursive 
real- valued functions. In each of the criteria we have not cared about domains 
of the functions to be learnt. In this section we discuss inferability of recursive 
real-valued functions on a fixed domain. Hence we can assume that the IIM 
knows the domain of functions to be learnt. We focus our attention to functions 
on a rational interval. 

Let / be a rational interval, and T be a class of recursive real-valued fun- 
ctions on I. In order to emphasize the /, we say that a class T is REALEX j- 
inferable, if T G REALEX, and by REALEX j we denote the collection of all 
REALEX j-inferahle classes T of recursive real- valued functions. Similarly we 
define REALCONSj, REALSCONSj, REALFIN i and REALNUM i to empha- 
size the interval I. Then we have the same results as in the previous section: 

Theorem 8. Let I be a rational interval. Then 

(1) REALCONSj C REALEX j, 

(2) REALSCONSj (^REALCONSj, 

(3) REALFIN J REALSCONSj, 

(4) REALFIN J \ REALNUM j ^ 0. 

Proof. (1) is obvious. (2) We recall T in the proof of Theorem 0 Every hi G T 
is a constant function on the same closed interval [0, 1]. For any hi G T, we con- 
struct a function hi from / to i? by hi{x) = hi(fS). Let To = {hi \ hi G T}. Then 
% G REALCONSj \ REALSCONSj. Hence REALSCONSj § REALCONSj. 
(3) Let T be the set of all constant function from I to Q. In the same way 
as in the proof of Theorem El it holds that T G REALSCONS j \ REALFIN j. 
Hence REALFIN j ^ REALSCONSj. (4) Let U G1 N he a, not recursively enu- 
merable set, and let T be the set of all constant function from I to U. Then 
r ^ REALNUM J and T G REALFIN j. □ 
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Proposition 2. Let I = [0, 1). Then REALCONSi § REALEXi. 

Proof. It is obvious that REALCONSi C REALEXi. We show that REALEX i\ 
REALCONSi ^ 0. Let T be the set of all recursive real-valued functions r]j from 
[0, 1) to R defined by rjj{x) = 2®+^(l — — x)h{i) + 2*+^(a; — 1 -I- ^)h{i + 1) 

if a; S [1 — ^,1— 2 i^)) where ft. is a recursive function from N to N such that 
there exists a number I G N such that h{l — 1) = 0, h{l) = j and h{n) > 0 
for any n > I, ipj = h tor a number j G N. Then T G REALEXi. Assume 
that there is an IIM Ai that RLALCONS i-inieis T. Let rj G T he a target 
function. Since r]\[i_x i ) is a recursive real-valued function for each i G N, 

there exists a presentation Gi = wCwh,-'' of i such that w\ = 

((1 — A, (ft(l “ ii): |))- Lot = Wi,W 2 , • • • be a presentation of rj such 

that Wn = where s,t G N that is(s — 1) < n — 1 < |s(s -I- 1) and 

t = ^s(s -I- 1) — n. By M{a[n],w) we denote the last guess of the IIM M 
requested data wi,W 2 , - ' ' i Wm w as inputs. For each i G N, we define a function 
f)i gT satisfies the following conditions: 

fti(O) = 0, 

fti(l) = h 

r 1 if Mo(crj)j|fc(fc-H 1)]) ^ Mo(cr^Jifc(fc-h l)],di), 

f]i{k -I- 1) = ^ 2 if Mo{afn -|- 1)]) = -I- 1)], di) 

[ and Mo{afi,[^k{k + 1)]) Mo(cr^Jift(ft -h l)],d 2 ), 

where k G N+,di = {{k + 1, t), (1, j))), and ^2 = {{k + 1, |), (2, j))). For each 

i G N, fji is a recursive real valued function, because of Aio{afi.[^k{k+l)], {{k + 
i))) 7^ Mo(afi,[^k{k + l)],((fc -h 1, |),(2, i))) for any k G N. By the 
recursion theorem, there is a number a G N such that ipa = fja- Thus it holds 
that fja & T and the progression {Mo(crjj^ [n])}„giv does not converge, which is 
contradiction. Hence T ^ RLALCONS . □ 

The following theorem asserts that every recursively enumerable class of re- 
cursive real- valued functions is REALCONS i-iaierabie, which is an interesting 
contrast to the result in Theorem |3 

Theorem 9. REALNUM i ^ REALCONSi for every rational interval L. 

Proof. Let T be a recursively enumerable set of recursive real- valued functions on 
a fixed closed or open rational interval I. According to [219] . T G REALCONS . 
Similarly we have T G REALCONS , even if / is a half-open rational interval. 
Thus REALNUM I C REALCONS i for every rational interval /. By Theorem0 
REALNUM I REALCONSi for every rational interval /. □ 

Theorem 10. REALNUM i \ REALSCONS i 0 for every rational interval L. 

Proof. We recall T in the proof of Theorem 0 Every ft S T is a constant 
function from [0, 1] to R. For any ft G T, we construct a function ftp from 
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/ to i? by /lo(a^) = h{0). Let To = {^o I h € T}. Then To S REALNUM i 
and % e REALSCONSj. Hence REALNUM i \ REALSCONSi 0 for every 
rational interval I. □ 

If the IIM knows that the domain of the target function is a closed rational 
interval, REALEX-mieidhiXity coincides with REALCONS-mierability, which is 
again an interesting contrast to the result in Proposition Q 

Theorem 11. REALCONS j = REALEX j for every closed rational interval I . 

Proof. Let I = [p — a,p + a] be a closed rational interval, where (p, a) G Q x 
Q+. By Theorem^ REALCONS I C REALEX j. We show that REALEX j C 
REALCONS I. Let T S REALEX j. Then there is an IIM M which REALEX i~ 
infers T. For any h G T and any presentation a = wi,W 2 ,' ' ' of h, there is an 
I G N such that M{a[n]) = M{(j[l]) for any n> I, and M{a[l\) is an algorithm 
which computes h. Let be a function the algorithm M{a[n\) computes for 
each n G N, and let Ah,^ = N4{a[n]). Then there exists a k G N such that is 
defined on / for any n > k, that is, I C U(a, [a - 7: a + 7 ] for any n > k, 
where Dg is the n— th division set of I w.r.t. Ah„ and i5 > 0 defined as follows: 

A” = {a : a = p + for some integer fc, — 2 " <k< 2 "}, 

= {( 0 , 7 ) GQxQ+ : o G A ",7 = ^, and 

Ah„{{a,y)) halts in at most n steps}, 

E>s = {(a, 7 ) e Ur=o ((“> t)) = {b, /3) and /3 < |j. 

Therefore we can determine whether there is an x such that hn{x) G [g — 2/3, q + 
2/3], for each datum {{p, a), {q, j3)) G a[n] with [p — a,p + a] CL. Thus we can 
construct an IIM Xio which REALCONS-raiers T as follows: 

IIM: Mo 
begin 

n := 1; D := 0; 5 := 1; 

repeat 

read Wn and D := DU {w„j; 

let be the n-th division set of / w.r.t. M{a[n]) and S; 
if T C U(a, 7 )GD" [a - 7, a + 7 ] and 

there exists an x such that hn{x) G [q — 2/3, q + 2/3], 

for each datum ((p, a), {q, ff)) G D with [p — a,p + a] CL 
then output M{a[n\) else output linear{D)- 
n := n + 1 
forever 
end. 



where linear{D) is an algorithm which computes a function h such that D is 
a finite set of data of h. Hence REALEX j C REALCONS i. Consequently, for 
every closed rational interval I, REALCONS i = REALEX i . □ 
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7 Conclusions 

In this paper we have considered learning recursive real-valued functions from 
data. We have shown that REALCONS 5 REALEX and REALNUM\REALEX 

0. We have also discussed the relationship between different criteria based on 
whether the inference machine knows the domains of functions to be learnt. More 
exactly we have shown that if the functions are defined on a fixed rational open 
or half-open interval, it holds that REALCONS ^ REALEX and REALNUM ^ 
REALCONS , and if they are defined on a fixed rational closed interval, it holds 
that REALCONS = REALEX. 
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Abstract. Concept drift means that the concept about which data is 
obtained may shift from time to time, each time after some minimum 
permanence. Except for this minimum permanence, the concept shifts 
may not have to satisfy any further requirements and may occur infini- 
tely often. Within this work is studied to what extent it is still possible 
to predict or learn values for a data sequence produced by drifting con- 
cepts. Various ways to measure the quality of such predictions, including 
martingale betting strategies and density and frequency of correctness, 
are introduced and compared with one another. For each of these mea- 
sures of prediction quality, for some interesting concrete classes, usefully 
established are (nearly) optimal bounds on permanence for attaining 
learnability. The concrete classes, from which the drifting concepts are 
selected, include regular languages accepted by finite automata of bo- 
unded size, polynomials of bounded degree, and exponentially growing 
sequences defined by recurrence relations of bounded size. Some impor- 
tant, restricted cases of drifts are also studied, e.g., the case where the 
intervals of permanence are computable. In the case where the concepts 
shift only among finitely many possibilities from certain infinite, arguably 
practical classes, the learning algorithms can be considerably improved. 



1 Introduction 



In many machine learning situations, the concepts to be learned or the con- 
cepts auxiliarily useful to learn may drift with time I'iltlblbiSIl tij . As in the just 
previous references, to sufficiently track drifting concepts to permit learning so- 
mething of them at all, it is necessary to consider some restrictions on the nature 
of the drift. For example, Helmbold and Long bound the probability of di- 
sagreement between subsequent concepts. Blum and Chalasani 0 place some 
constraints on how many different concepts may be used, or the frequency of 
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concept switches. Bartlett, Ben-David and Kulkarni 0 consider ‘class of legal 
function sequences’ based on some constraints (such as being formed from a walk 
on a directed graph). 

Most of the above work was in a setting similar to PAC learning. Our work 
addresses drift in a more general computability setting. In the present paper we 
consider some pleasantly modest restrictions on the rate with which one concept 
changes into another, model concepts as functions and employ as our principal 
learning vehicle (computable) martingale betting strategies [t)l14^ . 

M denotes the set of natural numbers {0,1,2,...}. Functions (as concepts) 
considered in this paper have domain A/” or, in some special cases, the set of 
binary strings {0, 1}* which is identified with A/” in a standard way. The range of 
the functions is normally A/", but it is sometimes {OM} (in the case of computable 
languages represented as characteristic functions )G or the set of integers I or 
rationals Q (in the cases of some concrete examples). 

It is not possible to predict the next values of a rapidly shifting concept if, 
in each time step, the concept changes without restriction. For example, a drift 
which randomly vacillates between the constantly 0 function and the constantly 1 
function can produce as a data sequence any {0, l}-valued function, and, hence, 
the class of such data sequences cannot be usefully predicted. 

Therefore, given a class S of functions, the learning tasks we consider in- 
volve data sequences for segments of members of S where these segments do 
not change from one member of S to another too often. We require that any 
concept/function from such an <S in a drifting data sequence be present for some 
minimal number of successive data points. We call a function p computing this 
minimal number the permanence. The class of data sequences with segments 
from members of <S with each segment required to be present with permanence 
p is called «S[p]. The formal definition follows immediately. 

Definition 1. Let S be a class of computable functions. A function f is said 
to be obtained from S by concept drift with permanence p if and only if, for 
each X, there is an interval containing x and a function px € S such that 
\Ix\ > p(niin(/a;)) and f{y) = gx{y), for all y € Ix. «S[p] denotes the class of all 
such functions f. 

We only consider permanence p such that p is non-decreasing and {1,2,3,...}- 
valued function. We always assume such restriction on p without explicitly say- 
ing so. 

Learning deals normally not with a single concept but with a class of con- 
cepts. Therefore it is necessary to define when a class of objects is learnable under 
a given criterion. As we see in the immediately following definition, learnability 
of a class is defined in terms of learnability of the single objects in it. 

Definition 2. A class S of functions can be learned under a given criterion 
with permanence p if and only if there is a computable and total machine M 
which succeeds on every function f € S[p] under the given criterion^ 

^ We sometimes call {0, l}-valued functions, binary functions. 

^ Here and below “total” means that the machine always has defined output. 
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So it suffices, then, to define various criteria under which a learner M is said 
to succeed on a single function /. Shortly below we define three such criteria of 
success. 

Learning is normally modeled as a process to identify an underlying global 
concept which describes the observed behavior. Under concept drift, this un- 
derlying global description does not exist or is too complicated. Therefore, the 
learner can be expected to give local descriptions only. Within this paper, the 
local behavior is mostly described by just guessing the next value(s)Q Because 
of the unpredictable drifts of the concept, it is unavoidable to err infinitely often. 
So the learning criteria considered, in effect, involve the ratios of successes and 
failure during the learning process. The learners studied in the sequel are always 
total and computable devices which give predictions for the values f{x+ 1) from 
the data /(O), /(I), . . . , /(x). The criteria of correctness for such devices differ in 
how the quantity of correct and incorrect predictions are measured and compa- 
red. The next three definitions introduce learning criteria each of which quantify 
the amount of correct prediction which is required of a successful learner M 
operating on a function / (normally in «S[p]). 

Definition 3. A learner M learns a function f (or predicts / ) with frequency 
a out of b if and only if, for each x, at least a of the equations 

/(y+l) = M(/(0)/(l).../(y)) 

are correct, where y ranges over the b arguments x,x-|-l,...,x-|-6— 1. We refer 
to such learners as frequency learners. 

We say that a class is frequency learnable if and only if some learner predicts 
all functions in the class with frequency a out ofb, for some a,b, with 1 < a<b. 

The requirement that, for each interval of length b, at least a of the predictions 
are correct is quite restrictive. This could be alleviated somewhat by aiming for a 
particular ratio between a and 6 in a limiting sense instead of requiring it for each 
interval. In other words, the set X of all correct predictions need only be of some 
minimum “density.” We employ a notion of density introduced by Tennenbaum 
IT^ §9.5/9-38] in formalizing this approach to frequency learners. Tennenbaum 
called the limit inferioiE of the sequence • (A(0) -I- A(l) ^(x)) the 

density of the set A0 Royer m introduced the related notion of uniform density 
of a set A to be the limit inferior of the sequence min{ • {A{y) -|- A{y -|- 1) -|- 
. . . -|- A{y + x)) \ y & These notions are incorporated in the next definition. 



® Since we deal almost always with “learning by prediction” we often just write “M 
learns /” as a short hand notation for “M learns / by predicting values of /” and 
so on. 

^ The definition of the limit inferior can be found in most advanced calculus text books, 
for example, ca- The limit inferior of a sequence oq, ai, . . ., is the supremum r of all 
rational numbers q which are below almost all Un-r = supremum{g : (V°“n) [q < a„]} 
For A C A/", A{x) — 1 H x € A and A{x) = 0 if x ^ T. 
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Definition 4. A learner M learns a function f (or predicts f) with (uniform) 
density q if and only if the (uniform) density of the set {x : M(/(0)/(l) . . . f(x)) 
= /(x+1)} is at least q. We refer to such learners as (uniform) density learners. 

It may be argued that in the criteria introduced so far, the learner is unnecessarily 
penalized by being required to make a prediction at all times. The learner is not 
allowed to use any knowledge about the times when predictions are easy and 
when they are difficult. The learner may be bogged down by difficult predictions 
even if it has some restricted knowledge which is enough to correctly predict the 
majority of values. A well-known setting that models such a case is the world 
of gambling 0. Here a gambler may decide whether and how much to bet on a 
certain prediction coming true or whether to pass if it is too difficult to make a 
prediction with a reasonable chance of success. Such a gambling learner is said 
to succeed if and only if it can extract enough information about the values 
of / so that successive betting (predicting) allows it to accumulate arbitrarily 
large amount of money. The following definition introduces this criterion via 
martingales. 

Definition 5. A martingale is a computable function m from strings to positive 
rational numbers such that, for every a, there is an a and a q which satisfy 

— 0 < q < rn(a); 

— m(aa) = m(a) + q and m(ab) = m(a) — q, for b ^ a. 

The martingale m learns a function f (or succeeds on / or wins on /) if and 
only if the function x — >■ m(/(0)/(l) . . . f(x)) is not bounded by any constant. 

Intuitively, the martingale calculates the accumulated wealth of a player who, 
for every sequence or string cr, bets an amount of money q that (a number) a 
will follow a and receives it in the case of success and loses it otherwise. This 
definition includes the ability to pass by betting 0 and also the ability to bet 
arbitrary small amounts of money. That is, there is no smallest unit like a “Cent” 
which cannot be split into smaller pieces. On the other hand the player cannot 
(in our definition) go broke by playing at some time his total accumulation at 
that time. This latter constraint is for expository convenience in the present 
paper — we avoid having to test for going broke — and our results hold with or 
without it. 

A martingale wins iff — according to the previous example — the gambler has 
arbitrary large amounts of money at some suitable time. This analogy becomes 
more striking by the fact, that the definition of martingale learning is invariant 
under the following change of definition. 

A martingale m learns f iff the limit inferior of m(f(0)f(l)...f(x)) is oo, 

that is, iff, for all c, for all but finitely many x, m(/(0)/(l).../(x)) > c. 

This is interesting since, when successful, the money of the gambler exceeds any 
given bound c almost always and not only infinitely often. 



280 



J. Case et al. 



Any of the above criteria requires that the learner correctly predicts infini- 
tely often on functions to be learned. One might say that this is an essential 
precondition for any kind of learning process. Hence we call a learning criterion 
reasonable, if it explicitly as above or at least implicitly requires that the learner 
M predicts each function to be learned infinitely often correctly. The class of 
all binary functions is not learnable with respect to a reasonable criterion: if M 
is a learner then one constructs a binary function / inductively by /(O) = 0; 
f{x -I- 1) = 1, if M(/(0)/(l) . . . f{x)) 4_= 0, and f{x -I- 1) = 0, otherwise. This 
function / disagrees with every prediction of M. So any criterion which allows to 
learn the class of all the binary functions is not reasonable. Frequency learning, 
martingale learning, and learning with a density q > 0 are reasonable criteria; 
learning with density 0 is not reasonable since the requirement for success is void. 

In the sequel we proceed as follows. 

In Section|2l we compare the relative predictive ability of martingale learners, 
frequency learners and density learners. We show that frequency learners are the 
most restrictive, while martingale learners and density learners with low density 
(below f ) are incomparable generalizations of them. 

In Section 0 we analyze the learnability of several interesting concrete con- 
cept classes under the various criteria introduced in the present section. Our 
upper bounds on permanence are also shown to be (nearly) optimal. 

We show that, for all h e AT - {0}, if constant permanence p satisfies 
p > {3h + 3) log{h + 3) then 5[p] is frequency learnable, where S is the class 
of the regular languages over the alphabet {0, 1} accepted by finite automata 
with up to h states. 

While polynomials of bounded degree are shown to be learnable with rea- 
sonable constant permanence under all our criteria, we show that the natural 
concept class of pattern languages PQ with erasing separates martingale learners 
from density learners (also from frequency learners and uniform density learners) . 
A martingale learner succeeds on the erasing pattern languages already at the 
small constant permanence 7. 

Fibonacci and other sequences defined by similar recurrence relations grow 
exponentially, yet we show such classes defined by bounded size of recurrence re- 
lations are learnable with reasonable constant permanence under all our criteria. 

While Sections El and 01 deal with drifts having no restrictions except for 
permanence bounds, Section El is devoted to some natural restrictions on drift 
like (a) the resulting function has to be computable, (b) the set Af is compu- 
tably partitioned into disjoint intervals Iq, /i, . . . such that each /„ has at least 
p(min(/„)) elements and each / G 5[p] presented to the learner agrees on each 
interval /„ with some function g„ G S and (c) the drift vacillates between a 
finite number of functions in «S. In each case, it is shown that there are classes <S 
and permanences p such that the class <S[p]' consisting of all functions f G S[p] 
satisfying an additional restriction on the drift can be learned with some smaller 
permanence or sharper learning criterion than the class <S[p]. Hence, there are 
always situations where that restriction on the drift pays off, that is, where kno- 
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wledge of some regularity within the drift allows construction of better learning 
algorithms. 

In the subsequent sections, logarithms are all base 2. Any computability 
terminology used below and not explained herein may be found in m Due to 
space limitations, we only give a few sample proofs. 

2 Martingale, Frequency and Density Learners 

The first result states that everything that can be learned by a frequency lear- 
ner can also be learned by a martingale learner. The strategy employed by the 
martingale learner is the well known doubling- algorithm which sometimes ruins 
gamblers but which nicely works in this case. We omit the proof. 

Proposition 6. Suppose a class, S, of functions (possibly but not necessarily 
generated by some concept drift) is frequency learnable. Then S can be learned 
by a martingale. 

The next result investigates the inclusion relation on frequency learning for diffe- 
rent parameters. We first introduce some definitions. In the following the natural 
numbers a,b,c,d always satisfy 1 < a < & and 1 < c < d. Let Fafi{bx + y) = 
ax, for y = 0,1, ... ,b — a, and Fa^h{bx + y) = ax + y, for y = —a -I- 1, . . . , 0 . 
Note that, for every natural number d, it is possible to find x G A/”, y £ T 
with —a < y < b — a, such that d = bx + y. For all a, b and d it holds that 
f-a<Fa,b{d) < f. 

Theorem 7. Every class learnable with frequency a out of b is also learnable 
with frequency c out of d if c < Fa^b{d). If c> Fafi{d), then there exists a class 
of functions which is learnable with frequency a out ofb, but not with frequency 
c out of d. 

Proof. For the first part, let c < Fa,b{d) and suppose M predicts S with fre- 
quency a out of b. We claim that M also predicts S with frequency c out of d. 
Let / be an arbitrary function in S. Suppose d = bx + y, where —a < y < b — a. 
Proof now proceeds based on whether y is positive. 

(a) : y > 0. Then Fa^b{d) = ax. Since M predicts correctly a values of / on 
every interval of length b, M also predicts correctly ax values of / on every 
interval of length bx which can be viewed upon as a union of x disjoint intervals. 
Since d> bx it follows that M predicts at least ax = Fa^b{d) values correctly on 
an interval of length d. 

(b) : y < 0. Then Fa^b{d) = ax + y. For the ease of notation let z = —y and 
d = bx — z, z is positive. Again one knows that M predicts correctly ax values of 
/ on an interval of length bx. From these predictions at most z can be correct on 
the last z arguments. So M makes at least Fa^b{d) = ax — z correct predictions 
on any interval of length d = bx — z. 

So in both cases M predicts / correctly on each interval of length d at 
least Fa^b{d) times, in particular at least c times. So M learns / with frequency 
c out of d. 
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For the second part of the theorem, consider the class of all primitive recursive 
functions which take the value 0 on the set X = {xb + y : y € {0, 1, . . . , a — 1}}. 
<S[p] is then the set of all functions (also the noncomputable ones) which are 0 
on the set X. For every learner M, there is a function / G «S[p] which differs 
from the predicted value on every z ^ X. So starting with any input of the form 
z = xb+a— 1, M correctly predicts /(z + u), for m = 1, 2, . . . , d, only if z + u is in 
X. Thus the number of correct predictions is at most \X fl {xb + a, + a + 1, 

. . . , a;& + a + d — 1}| = Fa,b{d). This completes the proof. | 

Fact 8 . For the notion of predicting with density and uniform density the follo- 
wing results hold. 

(a) If <S[p] is learnable with uniform density q, then 5[p] is also learnable with 
density q. On the other hand, there is a class S such that, for every permanence 
p, «S[p] is learnable with density 1 but not with any uniform density q > 0. 

(b) If «S[p] is learnable with frequency a out of 6 , then «S[p] is also learnable with 
uniform density 

(c) Some «S[p] is learnable with density | but not by any martingale. 

(d) If «S[p] is learnable with density g > |, then it is also learnable by a mar- 
tingale. 

Proof, (a): This implication of learning with uniform density towards learning 
with normal density follows directly from the definition. The separation follows 
ideas of Royer uni. Consider the class <S of all primitive recursive functions which 
are 0 on the set X = [x : (3y) [2^ < x < 2*^+^ — y] }. Then <S[p] contains all 
total functions which are 0 on X. So an algorithm which predicts 0 everywhere 
is correct on X. On the other hand, for any M, there is an / G S[p] which 
differs from the predictions on every input outside X. So <S[p] can be learned 
with density 1 (since X has density 1) but not with positive uniform density 
(since X has the uniform density 0). 

(b) : Let M be any learner which predicts all / G <S[p] with frequency a out 
of b. Then, for any interval of length d, M predicts all / G <S[p] with frequency 
Fa,b{d) out of d. Since Fa^b{d) > ^ — o, it follows that M learns 5[p] with 
uniform density lim^-^oo 3 ■ Fa,bi.d) = 3 . 

(c) : Let p{x) = 2x. Let S be the class of all primitive recursive {0, l}-valued 
functions g which satisfy — 21 og(a::) < g( 0 )-|-(;(l)-|-. . -+g{x) < ^^-|- 21 og(a;), 
for all a: > 1. There is a random function / which also satisfies this relation for 
all a; > 1 HU- This / is in <S[p], for any permanence p, since every prefix of / 
is extended by some g G S. On the other hand this sequence is not learnable 
by a martingale because of its randomness. The proof can now be completed by 
showing that just predicting 1 gives correctness density | or more, details are 
omitted due to space limitations. 

(d) : Schnorr jl4l Section 10] shows that every binary function not learnable 
by a martingale satisfies the law of large numbers, that is, the density of I’s 
converges to Furthermore he showed that if the density of I’s is larger than 
f , then some martingale succeeds by always betting a suitable amount of money 
on 1. Similarly one can argue, for <S[p] learnable by M with density q > 1/2, 
that some martingale succeeds on <S[p], by betting always a suitable amount of 
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money on the value predicted by M (since these predictions are correct on a set 
of density q > ^). | 

The results of FactElhave some straightforward extensions: Learnability by mar- 
tingales can also be obtained if S contains only functions / which are learnable 
via some fixed machine under some uniform density qf > 0 — or, equivalently, 
which are learnable under some frequency 1 out of 6/ . That means, that it is more 
important that all functions in the given class are learnable by the same learner 
than that they are learnable with respect to the same parameters. The other 
way, to fix the parameter but not the machine, does not help since every compu- 
table function is predictable with frequency 1 out of 1 — by its own program — 
but the class of all computable functions is not learnable by a martingale M- 

3 Concrete Classes 

In this section optimal and nearly optimal bounds are derived for the permanence 
necessary and sufficient to learn certain concrete classes under drift. 

Suppose <S is a class of up to k binary functions. We first investigate for which 
(constant depending on k) permanence p the class <S[p] is frequency learnable. 

Looking at the class of all binary functions which repeat with period [log(fc)J 
one sees directly that the condition p > log{k) is necessary — otherwise the class 
<S[p] contains every binary function and is not learnable under every reasonable 
criterion. On the other hand, there is an upper bound that is only a bit above this 
lower bound. The problem which gives an upper bound slightly larger than the 
expected value [log(A:) -I- IJ, is that one does not explicitly know the intervals on 
which / coincides with some g from the concept class. So the learner intuitively 
has to assume that these intervals may be chosen by an adversary. The implicit 
bound on p in the next theorem could also be made a bit more explicit by 
taking stronger sufficient conditions such as p> log(fc) -I- 21oglog(A: -|- 1) -|- 10 or 
P > log(A:) -1- log log (fc -I- 1) -I- 2 log log log(A: -I- 3) -I- 10. 

Theorem 9. Suppose S contains up to k computable {0, l}-valued functions 
and nothing else. Then 5[p] is frequency learnable if p — log(p) > log(fc). 

Proof. Fix k and corresponding p. Since all functions in S are computable and 
permanence is constant, it is possible to compute on any interval a: -I- 1, x -I- 2, 
...,x + b, the finite set of all possible value- vectors (/(x -I- l),/(x -I- 2), 
. . . , /(x + b)), where / ranges over <S[p]. Whenever there is a constant b such 
that \Fx\ < 2^, for all x, then one can predict one of the values in the given 
interval by the well-known halving algorithm. By restarting this process after 
any successful prediction one can show that 5[p] is predictable with frequency 
1 out of b. So it remains to find such a b. 

Let b = 2p — 1. Any interval /' of length b contains a subinterval I of length 
p on which / equals some g G S. Now / on F can be described by the starting 
point of /, which is among the first p positions of the interval, the index of the 
function g G S which coincides with / on I, and p — 1 binary bits to represent 
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the remaining values of /. Thus, there are k ■ p ■ possibilities which / can 
take on the interval I'. Since log(fc) < p — log(p), we have log{k) + log(p) < p, 
k ■ p < 2"P and k ■ p ■ 2 p~^ < 2^ which is the desired combinatorial condition. | 

As an application of this theorem, the class of all regular languages accepted by 
some deterministic finite automaton having at most h states, can be learned in 
the presence of concept drift with constant permanence (where, of course, the 
constant depends on h). We omit the details. 

Example 10. Suppose S is the class of the regular languages over the alphabet 
{0, 1} accepted by deterministic finite automata with to ft. states. Then, S[p] 
is frequency learnable, if p — log(p) > 3ftlog(ft + 1)0 For p < h, <S[p] is not 
learnable under any reasonable learning criterion, since <S[p] is the class of all 
the binary functions. We omit the proof. 



Finding the best permanence often requires considerable combinatorics. Some 
classes, such as polynomials, are easier to handle where a full solution of the pos- 
sible learning frequencies in dependence of the allowed degree and permanence is 
possible. The proof of Theorem^J (c) furthermore gives the more general result 
that a class, which contains a function extending every function with finite do- 
main, is not learnable under concept drift. The same principle holds if only the 
binary functions with finite domain are extended. Thus, one can obtain another 
proof for the second statement in the previous example. 

Theorem 11. Let k he a natural number and S be the class of all polynomials 
of degree up to k. 

(a) 5 [ft -I- 1] contains euery function and thus S cannot he learned with perma- 
nence k 1 under any reasonable learning criterion. 

(b) If h > k 1, then <S[ft] is learnable with frequency a out of b iff a < 
dd'h—k—l.hif') • 

(c) Let S be the class of all polynomials. Then, for every permanence p, the class 
«S[p] contains all total functions and thus is not learnable under any reasonable 
criterion. 

Proof, (a): Let = {n{k 1), n{k -I- 1) -I- 1, ... , n{k -f 1) -f k}; the intervals Iq, 
Ii, ... form a partition of Af and each interval contains exactly ft -|- 1 elements. 
Given any function /, one can find for each n a polynomial g„ of degree up to k 
which is equal to f on In. Thus, 

(V/) (Vn) {3gn G S) (Vx G /„) [gn(x) = f(x)] 

and <S[ft -I- 1] contains all the total functions. 

(b): For the positive result, with a < Fh-k-i,h{b), it is sufficient to show that 
S can be frequency learned with frequency h — k — 1 out of ft. The learner M 



For example, p — log(p) > 3ft log(ft -|- 1) holds if p > (3ft -I- 3) log(ft -I- 3). 
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predicts 0 for /(O), /(I), . . . , f{k) and M predicts gx{x + k + 1) for f{x + fc + 1), 
where gx is the polynomial of least degree which coincides with / on /(x), 
f{x + 1), . . f(x + k). Let / = {y, 2 / + 1, . . . , y + — 1} be an interval of length 

h and assume that y + u is the first place where the prediction algorithm makes 
an error, y + u must belong to some interval J of length h on which / coincides 
with some polynomial g of degree up to A:. Since M errs, u must be among the 
first fc + 1 elements of J. So M makes at least h — k — 1 correct predictions on 
the input y + u, y + u+1, y + u + h—1. Since M makes in total u+h — k — 1 
correct predictions on the interval {y, y + 1, . . . ,y + u + h — 1}, it follows that M 
makes at least h — k — 1 correct predictions on the interval {y, y+ 1, . . . , y+h— 1}. 

Now consider the converse direction. Given any learner M, one can use the 
intervals In = {hn, /m + 1, + 1} and find, for each n, a polynomial gn 

of degree not above fc, such that gn{hn + u) = M(/(0)/(l) . . . /(/m + u— 1)) + 1, 
for u = 0, 1, . . . , fc. Let / = (/„ on J„. This inductive procedure gives a function / 
such that M fails to predict f{x) correctly, whenever x is in {0,1,..., fc} modulo 
h. It follows that, if M learns / with frequency a out of b, then a < Fh-k-i,h{b) 
must hold. 

(c): This is similar to case (a). The growing permanence is compensated by 
the absence of any degree bound. Choosing a partition Iq, Ji, . . . of A/”, respecting 
the permanence, one can find, for each function / and each natural number n, 
a polynomial gn, which agrees with / on Thus, <S[p] contains every total 
function. | 

The values of polynomials can be computed from the preceding ones. So a linear 
function satisfies the equation /(x+2) = 2/(x+l)— /(x) and a quadratic function 
satisfies /(x + 3) = 3/(x + 2) — 3/(x + 1) + /(x). The functions satisfying such 
equations are a natural generalization of polynomials. The Fibonacci numbers, 
given by /(x+2) = /(x)+/(x+l), and the powers of 2, given by /(x+1) = 2/(x), 
cannot be represented by polynomials and demonstrate that the generalization 
is proper. In the case of polynomials, it was necessary to bound the degree in 
order to achieve learnability. For the generalization, this bound is given by the 
number of terms on the right-hand side of the recurrence relation 0. We omit 
the details. 

Example 12. Let S be the class of functions defined by a finite recurrence rela- 
tion 

f{x + k + l) = ao/(x) -k ai/(x -k I) -k . . . -k akf{x + fc), (I) 

where the values /(O), /(I), . . ., /(fc) can be chosen arbitrarily. This class S is 
frequency learnable with permanence 2fc -k 3 but not with permanence fc -k 2. 

It is quite natural to ask whether the lower bound can be lifted to 2fc -k 2. 
The following example illustrates that a lower bound 2fc -k 2 would need some 
nontrivial properties of the space of the values which perhaps are present in the 
field Q and in the ring X of the integers but which are certainly not present in 
the Boolean field (0, 1}. Again, we omit the details. 
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Example 13. Let S be the class of functions defined by a finite recurrence rela- 
tion 

g{x -I- fc -I- 1) = aog{x) + a\g{x -I- 1) -I- ... -I- akg{x + k), 

over the Boolean field {0, 1} where the multiplication is the “Boolean and”, the 
addition is the “Boolean exclusive or” and the values g(0), g(l), . . ., g{k) can be 
chosen arbitrarily. This class <S is frequency learnable with permanence 2k + 2. 

The pattern languages PJ ^re a prominent and natural language class. We consi- 
der a known natural extension with the aim of showing that some natural class 
S separates the ability to learn by a martingale from that to learn by a frequency 
learner. 

Let each Boolean string a be identified with the number x such that Icr is 
the dual code for x -I- 1; so 00 is identified with 3 and 111 is identified with 14. 
A pattern is a schema consisting of variables and constants. It generates the 
language of all words which can be obtained by replacing each variable by a 
fixed binary string. A pattern language P5 is called erasing if the variables in 
the defining pattern may be replaced by the empty string. So the pattern Qxlxy 
generates words like 01, 010, 011, 0010, 00100, 00101 and so on, but it does 
not generate the words 0000 and 11111 since the constants 0 and 1 cannot be 
removed. The proof is quite detailed and is omitted due to lack of space. 

Example 14- If is the class of all erasing pattern languages then <S[7] can be 
learned by a martingale but <S[p] is not frequency learnable even for very fast 
growing permanences. For constant permanence it is also impossible to learn 
<S[p] with some density q > 0. 

4 Restrictions on Drift 

The previous section dealt with arbitrary drift and therefore the learning al- 
gorithms intuitively had to compensate for drifts produced by an arbitrarily 
unpleasant adversary. One might argue that nature does not always follow the 
worst case but is sometimes more pleasant and well behaved. In particular, drif- 
ting concepts might follow some rules and laws; the next three sub-sections are 
devoted to discussing the influence of such rules on the ability to learn under 
concept drift. So we derive conditions under which the subclass C S[p] of 
the functions resulting from restricted drift may be (and are) easier to learn. 
Due to space limitation, we omit proofs of most results in this section. 

4.1 Drifts Preserving Computability 

Let REC be the set of all computable functions. The present section investigates 
the case where the drift results in computable functions, that is, where = 
<S[p] n REC. The results of Sections |2| and 0 carry over to the case where S[p]' 
is used instead of <S[p]; provided that in the places where something is “not 
learnable under any reasonable learning criterion”, this statement is weakened 
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to “not learnable under any criterion which does not permit the learnability of 
all binary recursive functions.” That the inclusions in the previous results go 
through is quite obvious but the noninclusions requires some additional work: 
instead of taking an arbitrary function for diagonalization one has to construct, 
for every computable learner, a specific computable function in 5[p] on which 
this learner fails. 

Some criteria like learning with a fixed frequency a out of b either succeed on 
a function / or fail already on some finite prefix of /. So whenever such a learner 
fails on some / one can abstain from changing the concept after this failure. So, 
if a given learner fails on some / G «5[p], then it also fails on some computable / 
from the same class. Thus the question whether S[p] is learnable with frequency 
a out of b does not depend on the decision whether all or only the computable 
functions in 5[p] have to be learned. 

For the other learning criteria it is decidable in the limit whether the learner is 
successful or not. So one can, for certain problems, compensate early errors by a 
lot of good predictions. For these criteria it can be an essential difference whether 
the learner has to cope with the whole class «S[p] or only the subclass of all 
computable functions in <S[p]. In particular the next theorem shows that there 
are classes where this transition allows a large improvement in learnability. 

Theorem 15. There is a class S such that, for any p, the class <S[p] cannot 
be learned under any reasonable learning criterion, since, for each learner M , 
there is a function f G «S[p] which is never correctly predicted by M. However, 
the subclass C 5[p] consists only of functions with finite support and is 

therefore learnable with uniform density 1. 

4.2 Equality on Computable Intervals 

The second model limits the drift by requiring computable intervals on which 
the function to be predicted equals some concept in S. (We use S[p]' to denote 
the drift class formed in this fashion). This allows, for example, a reduction in 
the upper bound from Theorem 0 

Example 16. If S contains up to k finite functions and p > log(fc), then the 
functions in 5[p] respecting the computable intervals Io,/i,..., are frequency 
learnable by just using the majority vote algorithm on each interval The 
frequency is 1 out of 2[log(fc)J + 1. 

One might argue that such an improvement is due only to the ease of finding 
an algorithm and not to any real difference between the two concepts. The next 
example shows that there is a class S such that <S^[2] is frequency learnable 
for computable intervals while the general class <S[2] is not learnable under any 
reasonable criterion for arbitrary drift. 

Example 17. Let S be the class of all increasing binary functions, that is, S — 
: n G A/"}. Let Iq, /i, . . . be a computable partition of Af such that every 
interval /„ contains at least two elements. Let <S[2]' be the class of all functions 
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/ G <S[2] which in addition coincide with some G <S on every interval Now 
5 [2]' is frequency learnable while 5 [2] itself is not learnable under any reasonable 
learning criterion since 5[2] = 0 • {0, 1}°°. 

In the above proof we used disjoint intervals of consistency. The next example 
shows that disjointness itself (and a little more) can yield advantages in terms 
of learnability with higher frequencies. 

Example 18. Let <S contain all functions which are 0 at all but one argument. 
Then the subclass <S[2]' of all functions / G <S[2] which coincide with functions 
in <S on disjoint intervals of length at least 2 is learnable with frequency 2 out 
of 5. The whole class 5 [2] is not learnable with frequency 2 out of 5, though it 
is learnable with frequency 2 out of 6. The corresponding densities of the best 
possible learning algorithms are ^ and 

4.3 Vacillating Drift 

There are cases where, in principle, a drifting concept might involve any members 
of some infinite class but, in reality, the drift is only between finitely many of 
them. In this case, this knowledge can be exploited to achieve real improvements 
in learnability. 

As in the case of computable drift, vacillation cannot be exploited for fre- 
quency learners. However an improvement can be observed for other types of 
learning considered in this paper, that is for martingale learners, learners with 
some density and learners with uniform density. 

It should be noted that such an improvement is possible on many practical 
classes and not only on some artificially constructed examples as in the case 
of computable drift. These examples are the class of all polynomials for the 
case of constant permanence and any uniformly enumerable class for the case of 
nonconstant permanence. 

Example 19. Let p be constant and <S[p]' denote the class of all functions which 
vacillate between a finite number of polynomials with permanence p. Then S[p]' 
can be learned with uniform density 

Proof. Let go, gi, . . .he a, 1-1 enumeration of all the polynomials. Now the learner 
M searches on input /(0)/(l) . . . f{x) for the first k such that gk{x) = f{x). Then 
M outputs gk{x -I- 1) as a prediction for the next value: 

^(/(0)/(l) ■ • • f{x)) = 9u,in{k-.gu{x)=f(x)}{x + 1). 

For the verification of this algorithm, fix / G There is an h such that 

/ vacillates only between the functions go, gi, . . . , gh- Any two distinct polyno- 
mials agree on only finitely many arguments. So there is a j/ such that all the 
polynomials go, gi, . . . , gh are different at each x > y. 

Now let X > y and assume that the prediction for f{x + 1) fails. There is a 
unique k < h with f(x -I- 1) = gk{x + 1). By assumption f(x) yf gk{x) since the 
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prediction failed. Now / and gk coincide at an interval of length p containing 
a: + 1 and not x. Thus the predictions for / at x + 2,x + 3,...,x + p are correct. 
Hence, each wrong prediction is followed by at least p—1 correct ones. It follows 
that M, on an interval of length x, makes at most y + | mistakes. Thus M learns 

this / and also all other functions in <S[p]' with uniform density I 

The above algorithm works for the special case of polynomials and there is no 
directly general equivalent. For example, if S is the class of all periodic functions, 
then no learner achieves some minimum density on all functions in for 

constant permanence p. Hereby a function is periodic if there is a y such that 
f{x + y) = f{x) for all X. However, in the case of unbounded permanence, that 
is, in the case that p is not decreasing and not bounded by any constant, it is 
possible to learn the class S\p]' with uniform density 1. 

Theorem 20. Let S — gQ,gi,... be an effectively enumerable class of total 
functions and let p be a computable non- decreasing and unbounded permanence. 
Then the class of all functions in <S[p] which vacillate between finitely many 
functions in S can be learned with uniform density 1. 

5 Some Concluding Remarks 

Finally, we would like to note that there is a connection between our model 
and the mistake-bound learning model of Littlestone inil0 Consider the setting 
in which a machine M predicts the values of a function / on a sequence of 
arguments xq, Xi, X 2 , . . . as follows: M is given Xq, M predicts the value of / at 
xq, M is given /(xq), M is given Xi, M predicts the value of / at Xi, M is given 
/(xi), M is given X 2 , and so on. We say that M learns a class S of functions with 
mistake-bound c if M predicts, for each sequence xq, xi, X 2 , . . . and each function 
f £ S, the function i — )> f{xi) at all but at most c places correctly. Since this 
literal restriction of the mistake-bound model is restrictive, we make the model 
somewhat more interesting by additionally requiring the sequence xo,xi,X 2 . . . 
to be increasing. An example class of functions learnable with mistake-bound c 
is Sc = {decreasing / : /(O) < c}. 

Now the following can be shown to hold: If a class S is learnable with a 
mistake-bound of c and ii b < p for some constant permanence p, then 5[p] is 
learnable with frequency a out of b where a = b—2c—l. Furthermore, the class 
<Sc[p] cannot be learned with frequency a -I- 1 out of b, so the bound is tight. 
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Abstract. In their pioneering work, Mukouchi and Arikawa modeled a 
learning situation in which the learner is expected to refute texts which 
are not representative of L, the class of languages being identified. Lange 
and Watson extended this model to consider justihed refutation in which 
the learner is expected to refute texts only if it contains a finite sample 
unrepresentative of the class L. Both the above studies were in the con- 
text of indexed families of recursive languages. We extend this study 
in two directions. Firstly, we consider general classes of recursively enu- 
merable languages. Secondly, we allow the machine to either identify 
or refute the unrepresentative texts (respectively, texts containing hnite 
unrepresentative samples). We observe some surprising differences bet- 
ween our results and the results obtained for learning indexed families 
by Lange and Watson. 



1 Introduction 

Consider the identification of formal languages from positive data. A text for 
a language is a sequential presentation (in arbitrary oder) of all and only the 
elements of the language. In a widely studied identification paradigm, called 
Txt Ex-identification, a learning machine is fed texts for languages, and, as 
the machine is receiving the data, it outputs a (possibly infinite) sequence of 
hypotheses. A learning machine is said to TxtEx-identify a language L just in 
case, when presented with a text for L, the sequence of hypotheses output by the 
machine converges to a grammar for L (formal definitions of criteria of inference 
informally presented in this section are given in Sections El and I3) . A learning 
machine TxtEx-identifies a class, £, of languages if it TxtEx-identifies each 
language in £. This model of identification was introduced by Gold jiEinii and 
has since then been explored by several researchers. 

For the following, let £ denote a class of languages which we want to identify. 
The model of identification presented above puts no constraint on the behaviour 
of the machine on texts for languages not in £. However, we may want a machine 
to be able to detect that it cannot identify an input text for at least two reasons. 
Firstly, once a machine detects that it cannot identify an input text, we can 
use the machine for other useful purposes. Secondly, we may employ another 
machine to identify the input text, so as to further enhance the class of languages 
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that can be identified. These are very useful considerations in the design of 
a practical learning system. Further, it is philosophically interesting to study 
machines which know their limitations. 

In their pioneering work, Mukouchi and Arikawa |MA9d| modeled such a 
scenario. They required that in addition to identifying all languages in £, the 
machine should refute texts for languages not in C (i.e. texts which are “unrepre- 
sentative” of £). We refer to this identification criterion as TxtRef. Mukouchi 
and Arikawa showed that TxtRef constitutes a serious drawback on the lear- 
ning capabilities of machines. For example, a machine working as above cannot 
identify any infinite language 0 This led Lange and Watson jIjW94j (see also 
to consider justified refutation in which they require a machine to re- 
fute a text iff some initial segment of the text is enough to determine that the 
input text is not for a language in C, i.e., the input text contains a finite sam- 
ple “unrepresentative” of C. We call this criteria of learning TxtJRef. Lange 
and Watson also considered a modification of justified refutation model (called 
Txt JIRef , for immediate justified refutation) in which the machine is required 
to refute the input text as soon as the initial segment contains an unrepresen- 
tative sample (formal definitions are given in Section EJ . For further motivation 
regarding learning with refutation and its relationship with Popper’s Logic for 
scientific inference, we refer the reader to and mm- Jantke 

and Grieser |Gri9ti| have studied criteria similar to those studied in this paper 
for function learning. Ben-David studied refutation model for PAG learning in 

mm- 

I1VLA93I and ILW 941 were mainly concerned with learning indexed families 
of recursive languages, where the hypothesis space is also an indexed family. In 
this paper, we extend the study in two directions. Firstly, we consider general 
classes of r.e. languages, and use the class of all computer programs (modeling 
accepting grammars) as the hypothesis space. Secondly, we allow a learning 
machine to either identify or refute unrepresentative texts (texts containing finite 
unrepresentative samples). Note that in the models of learning with refutation 
considered by |MA98j and |I^W94j described above, the machine has to refute all 
texts which contain samples unrepresentative of C. Thus, a machine which may 
identify some of these texts is disqualified!! For learning general classes of r.e. 
languages we feel that it is more reasonable to allow a machine to either identify 
or refute such texts (in most applications identifying an unrepresentative text 
is not going to be a disadvantage). This motivation has led us to the models 
described in the present paper. We refer to these criteria by attaching an E (for 
extended) in front of the corresponding criteria considered by |MA93ILW94| . 



^ A machine working as above, cannot refute a text for any subset of a language it 
identifies; this along with a result due to Gold |Gol67| (which says that no machine 
can TxtEx-identify an infinite language and all of its finite subsets) shows that no 
machine can TxtRef-identify a class containing an infinite language. 

^ This property and the restriction to indexed families is crucially used in proving 
some of the results in mm- 
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We now highlight some important differences in the structure of results ob- 
tained by us, and those in lEWMI . In the context of learning indexed families of 
recursive languages, Lange and Watson (in their model, see also Esna) sho- 
wed that TxtJIRef = TxtJRef (i.e. requiring machines to refute as soon as 
the initial segment becomes unrepresentative of £, is not a restriction). Similar 
result was also shown by them for learning from informantt0. We show that re- 
quiring immediate refutation is a restriction if we consider general classes of r.e. 
languages (in both our (extended) and Lange and Watson’s models of justified 
refutation, and for learning from texts as well as informants). We also consider 
a variation of our model in which “unrepresentative” is with respect to what 
a machine identifies and not with respect to the class £. In this variation, for 
learning from texts, (immediate) justified refutation model has the same power 
as TxtEx — a surprising result in the context of results in jIAV94j and other 
results in this paper. However, in the context of learning from informants, even 
this variation fails to capture the power of InfEx (which is a criterion of learning 
from informants; see Section 121) 

We now proceed formally. 

2 Preliminaries 

The recursion theoretic notions not explained below are from |Kog67| . N = 
{0, 1,2,.. .} is the set of all natural numbers, and this paper considers r.e. subsets 
L of N. All conventions regarding range of variables apply, with or without 
decoration^ unless otherwise specified. We let c, e, i, j, k, I, m, n, p, s, t, u, v, 
w, X, y, z, range over N. Symbols 0, G, C, D, c, D denote empty set, member 
of, subset, superset, proper subset, and proper superset respectively. Notation 
max(), min(), and card() denote the maximum, minimum, and cardinality of a 
set respectively, where by convention max(0) = 0 and min(0) = oo. (•, •) stands 
for an arbitrary, one to one, computable encoding of all pairs of natural numbers 
onto N. Quantifiers and 3! denote for all but finitely many, there exist 

infinitely many, and there exists a unique respectively. 

TZ denotes the set of total recursive functions from N to N. f and g range 
over total recursive functions. £ denotes the set of all recursively enumerable 
(r.e.) sets. L ranges over £. L denotes the complement of set L (i.e. L = N — L). 
XL denotes the characteristic function of set L. L 1 AL 2 denotes the symmetric 
difference of L\ and L 2 , i.e., L 1 AL 2 = (Li — L 2 ) U (L 2 — Li)- £ ranges over 
subsets oi £. <p denotes a standard acceptable programming system (acceptable 
numbering) Pi denotes the function computed by the i-th program in 

the programming system p. We also call i a program or index for pi. For a 
(partial) function p, domain(? 7 ) and range(r 7 ) respectively denote the domain 
and range of partial function p. We often write p{x)l {p{x)t) to denote that 
p{x) is defined (undefined). Wi denotes the domain of pi. Wi is considered as 
the language enumerated by the *-th program in p system, and we say that i 

® An informant for a language L is a sequential presentation of the elements of the set 
K®, 1) I a; € L} U {(a;, 0) | a; ^ L}; see formal definition in Section|21 
^ Decorations are subscripts, superscripts, primes and the like. 
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is a grammar or index for Wi. denotes a standard Blum complexity measure 
m sa for the programming system ip. Wi^s = {a; < s | <Pi{x) < s}. 

FIN denotes the class of finite languages, {L \ card(L) < 00 } . INIT denotes 
the class of initial segments of N, that is {{x | x < Z} | Z G N}. L is called a 
single valued total language iff (yx){3\y)[{x,y) G L], svt = {L | L is a single 
valued total language }. If L G svt, then we say that L represents the total 
function / such that L = {(x, /(x)) | x G N}. K denotes the set {x | Px{x)\,}. 
Note that K is r.e. but K is not. 

A text is a mapping from N to iVU{#}. We let T range over texts. content(T) 
is defined to be the set of natural numbers in the range of T (i.e. content(T) = 
range(r) — {#}). T is a text for L iff content(T) = L. That means a text for L 
is an infinite sequence whose range, except for a possible is just L. 

An infinite information sequence or informant is a mapping from N to {N x 
{0,1}) U {#}. We let / range over informants. content(J) is defined to be the 
set of pairs in the range of / (i.e. content(J) = range(/) — {#}). By PosInfo(/) 
we denote the set (x | (x,l) G content(/)|. By NegInfo(/) we denote the set 
{x I (x,0) G content(/)|. For this paper, we only consider informants / such 
that PosInfo(/) and NegInfo(/) partition the set of natural numbers. 

An informant for L is an informant I such that PosInfo(I) = L. It is useful 
to consider canonical information sequence for L. J is a canonical information 
sequence for L iff /(x) = {x,xl{x))- We sometimes abuse notation and refer to 
the canonical information sequence for L by xl- 

a, T, and 7 range over finite initial segments of texts or informants, where the 
context determines which is meant. We denote the set of finite initial segments of 
texts by SEG and set of finite initial segments of informants by SEQ. We define 
content(cr) = range((r) — {#} and, for a G SEQ, Poslnfo(cr) = (x | (x, 1) G 
content(cr)}, and Neglnfo(cr) = (x | (x, 0) G content(CT)} 

We use a < T (respectively, a < I, a < t) to denote that a is an initial 
segment of T (respectively, /, t). \a\ denotes the length of a. T[n] denotes the 
initial segment of T of length n. Similarly, I[n\ denotes the initial segment of 
I of length n. a o T (respectively, a oT, a o I) denotes the concatenation of a 
and T (respectively, concatenation of a and T, concatenation of a and I). We 
sometimes abuse notation and say a o w to denote the concatenation of a with 
the sequence of one element w. 

A learning machine (also called inductive inference machine) M is an algo- 
rithmic mapping from initial segments of texts (informants) to {N U {?}). We 
say that M converges on T to i, (written: M(T)}, = i) iff, for all but finitely 
many n, M(T[n]) = i. Convergence on informants is defined similarly. 

We now present the basic models of identification from texts and informants. 

Definition 1 . IGolfilG!,S‘il 

(a) M Txt'Ex-identifies text T iff (3i | Wi = content(T))[M(T)4, = i\. 

(b) M TxtEx-identifies L (written: L G TxtEx(M)) iff M TxtEx-identifies 
each text T for L. 

(c) M TxtEx-identifies £ iff M TxtEx-identifies each L G C. 

(d) TxtEx = {£ I (3M)[M TxtEx-identifies £]}. 
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Definition 2. [(fol67| 

(a) M TxtFin-identifies text T iff (3z | Wi = content(T))(3n) [(Vm < 
n)[M(T[m]) =?] A (Vm > n)[M(T[m]) = z]]. 

(b) M TxtFin-identifies L (written: L € TxtFin(M)) iff M TxtFin- 
identifies each text T for L. 

(c) M TxtFin-identifies £ iff M TxtFin-identifies each L G C. 

(d) TxtFin = {£ | (3M)[M TxtFin-identifies £]}. 

Intuitively, for finite identification, M outputs just one grammar, which must 
be correct. 

Definition 3. KlolbTK IbS'il 

(a) M InfEn-identifies informant I iff (3z | Wi = PosInfo(/)) [M(/)4- = *]• 

(b) M InfEx-identifies L (written: L G InfEx(M)) iff M InfEx-identifies 
each informant / for L. 

(c) M InfEx-identifies £ iff M InfEx-identifies each L G C. 

(d) InfEx = {£ I (3M)[M InfEx-identifies £]}. 

Definition 4. [(lolBT) 

(a) M InfFin-ideniz/ies informant I iff (3z | Wt = PosInfo(/))(3n)[(Vm < 
n)[M(/[m]) =?] A (Vm > n)[M(/[m]) = z]]. 

(b) M InfFin-identifies L (written: L G InfFin(M)) iff M InfFin-identifies 
each informant / for L. 

(c) M InfFin-identifies £ iff M InfFin-identifies each L G C. 

(d) InfFin = {£ | (3M)[M InfFin-identifies £]}. 

The next two definitions introduce reliable identification. A reliable machine 
diverges on texts (informants) it does not identify. Though a reliable machine 
does not refute a text (informant) it does not identify, it at least doesn’t give 
false hope by converging to a wrong hypothesis. This was probably the first 
constraint imposed on machine’s behaviour on languages outside the class being 
identified. We give two variations of reliable identification based on whether the 
machine is expected to diverge on every text which is for a language not in £, 
or just on texts it does not identify. 

For the rest of the paper, for criteria of inference, J, we will only define what 
it means for a machine to J-identify a class of languages £. The identification 
class J is then implicitly defined as J = {£ | (3M)[M J-identifies £]}. 

Definition 5. fMinTBj 

(a) M TxtRel-identifies £ iff 
(a.l) M TxtEx-identifies £ and 
(a.2) (VT I content(T) ^ £)[M(T)t]. 

(b) M InfRel identifies £ iff 
(b.l) M InfEx-identifies £ and 
(b.2) (V/ I PosInfo(/) ^ £)[M(/)t]. 

(c) M ETxtRel-identifies £ iff 
(c.l) M TxtEx-identifies £ and 



296 



S. Jain 



(c.2) (VT I M does not TxtEx-identify T)[M(T)t]- 
(d) M EInfRel identifies £ iff 
(d.l) M InfEx-identifies £ and 
(d.2) (V/ I M does not InfEx-identify /)[M(/)t]- 

The following propositions are some known facts about the identification cri- 
teria discussed above, which we will be using in this paper. First two propositions 
are based on results due to Gold fn^ . 

Proposition 6. Suppose L is any infinite r.e. language, and M a learning ma- 
chine. Let (7 be such that content (ct) C L. Then there exists an r.e. £', content (a) 
Q L' Q L such that M does not TxtEx-identify V . 



Proposition 7. Suppose L is any infinite r.e. language, and M a learning ma- 
chine. Let <J be such that Poslnfo(cr) C L. Then there exists an r.e. £', PosInfo((r) 
^ L' C L such that M does not InfEx-identify £'. 



Proposition 8. 



[(HB^ 



TxtFin C InfFin C TxtEx C InfEx. 



3 Learning with Refutation 

In this section we introduce the refutation models for learning. For learning 
with refutation we allow learning machines to output a special refutation symbol 
denoted T. We assume that if M(cr) =T, then, for all r, M(ctot) =T. Intuitively 
output of T denotes that M is declaring the input to be “unrepresentative”. 
In the following definitions we consider the different criteria mentioned in the 
introduction. It is useful to define Cons£ = {a \ {3L G £)[content(cr) C £]}. 

The following definition introduces learning with refutation for general classes 
of r.e. languages. 

Definition 9. jM Afi.'IIJ M TxtRef identifies £ iff 

(a) M TxtEx-identifies £ and 

(b) (VT I content(T) ^ £)[M(T)f =T]. 

If M(T)4, =T, then we often say that M refutes the text T. The following 
definitions introduce identification with justified refutation for general classes of 
r.e. languages. Below JRef stands for justified refutation, and JIRef stands for 
justified immediate refutation. 

Definition 10. [r,W94j M TxtJRef identifies £ iff 

(a) M TxtEx-identifies £ and 

(b) (VT I content(T) ^ £ and {3a :<T)[a ^ Cons£])[M(T)j, =T]. 

Intuitively, in the above definition, M is required to refute a text T only if 
T contains a finite sample which is unrepresentative of £. Following definition 
additionally requires that M refutes an initial segment of T as soon as it contains 
an unrepresentative sample. 





Learning with Refutation 297 



Definition 11. | |i jW 94| M TxtJIRef identifies C iff 

(a) M TxtEx-identifies C and 

(b) (VT I content (T) ^C){Ma<T\a^ Cons£)[M(cr) =_L], 

We now present the above criteria for learning from informants. It is useful 
to define the following analogue of Cons. ICons^ = {a \ {3L G £)[PosInfo(CT) C 
L A Neglnfo(cr) Q L]}. 

Definition 12. 

(a) |MA98j M InfRef identifies £ iff 
(a.l) M InfEx-identifies £ and 

(a.2) (VI I PosInfo(I) ^ £)[M(J)i =_L]. 

(b) jbWMj M InfJRef identifies £ iff 
(b.l) M InfEx-identifies £ and 

(b.2) (V/ I PosInfo(/) ^ £ and {3a < I)[<J ^ ICons^]) [M(/)4. =_L]. 

(c) |l M InfJIRef identifies £ iff 
(c.l) M InfEx-identifies £ and 

(c.2) (V/ I PosInfo(/) ^ £)(Vct ^ / I cr ^ ICons£)[M(cr) =_L]. 

We now present our extended definition for learning with refutation. Intuiti- 
vely, we extend the above definitions of [MADdlLW 94| by allowing a machine to 
identify an unrepresentative text. 

The following definition is a modification of the corresponding definition in 
jM A 9,'-!] . E in the beginning of criteria of inference, such as ETxtRef, stands 
for extended. 

Definition 13. M ETxtRef identifies £ iff 

(a) M TxtEx-identifies £ and 

(b) (VT I M does not TxtEx-identify T)[M(r)4, =-L]. 

Intuitively, in the above definition we require the machine to refute the input 
text, only if it does not TxtEx-identify it. 

The following definitions on identification by justified refutation are modifi- 
cations of corresponding definitions considered by ITWTRI . 

Definition 14. M ETxtJRef identifies £ iff 

(a) M TxtEx-identifies £ and 

(b) (VT I M does not TxtEx-identify T and (3cr :< T) [a ^ Cons£]) 
[M(T) =T]. 

Intuitively, in the above definition, M is required to refute a text T only if M 
does not identify T, and T contains a finite sample which is unrepresentative of 
£. In the following definition, we additionally require that M refute an initial 
segment of T as soon as it contains an unrepresentative sample. 

Definition 15. M ETxtJIRef identifies £ iff 

(a) M TxtEx-identifies £ and 

(b) (VT I M does not TxtEx-identify T)(Vct ^ T | cr ^ Cons£)[M(CT) =T]. 
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We now present the above criteria for learning from informants. 

Definition 16. 

(a) M ElnfRef identifies C iff 
(a.l) M InfEx-identifies £ and 

(a. 2) (VI I M does not InfEx-identify /)[M(/)| =-L]. 

(b) M EInfJRef identifies £ iff 
(b.l) M InfEx-identifies £ and 

(b.2) (V/ I M does not InfEx-identify / and {3a :< I) [a ^ ICons£]) 
[M(/)i =£]. 

(c) M EInfJIRef identifies £ iff 
(c.l) M InfEx-identifies £ and 

(c.2) (V/ I M does not InfEx-identify 1 )(V<t :< I \ a ^ ICons£)[M(cr) =-L]. 

4 Results 

We next consider the relationship between different identification criteria defined 
in this paper. The results presented give a complete relationship between all the 
criteria of inference introduced in this paper. 

4.1 Containment Results 

The following containments follow immediately from the definitions. 

Proposition 17. TxtRef C TxtRel C TxtEx. 

TxtRef C TxtJRef C TxtEx. 

TxtJIRef C TxtJRef C TxtEx. 

InfRef C InfRel C InfEx. 

InfRef C InfJRef C InfEx. 

InfJIRef C InfJRef C InfEx. 

TxtRef C InfRef C InfEx. 

TxtRel C InfRel C InfEx. 

Proposition 18. ETxtRef C ETxtRel C TxtEx. 

ETxtRef C ETxtJRef C TxtEx. 

ETxtJIRef C ETxtJRef C TxtEx. 

ElnfRef C ElnfRel C InfEx. 

ElnfRef C EInfJRef C InfEx. 

EInfJIRef C EInfJRef C InfEx. 

ETxtRef C ElnfRef C InfEx. 

ETxtRel C ElnfRel C InfEx. 

Proposition 19. (a) TxtJIRef = ETxtJIRef. 

(b) InfJIRef = EInfJIRef. 

Proof, (a) It suffices to show ETxtJIRef C TxtJIRef. Suppose M ETxtJIRef- 
identifies £. 

Claim 20. For all a such that a ^ Cons£, M(cr) =T. 
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Proof, (of Claim) Suppose by way of contradiction that there is a ct such that 
a ^ Cons£ and M(ct) y^_L. Let T be an extension of a such that M does not 
TxtEx-identify T. Note that by Proposition ?? there exists such a T. But then 
by definition of ETxtJIRef, M(cr) =_L. A contradiction. Thus claim holds. □ 
It immediately follows from the claim that M also TxtJIRef-identifies £. 
Part (b) can be proved in a manner similar to part (a). | 



Theorem 21. Suppose X is a set not in A 2 . Let C = {{z} | i G X}. Suppose 

J G { TxtRef, TxtRel, TxtJRef, InfRef, InfRel, InfJRef}. Then C G EJ 
but C ^ 3. 

Proof. It is easy to construct a machine which identifies all texts for empty or 
singleton languages and refutes/di verges on all texts for languages containing at 
least 2 elements. Thus, we have that C G EJ. 

Now, suppose by way of contradiction that M J-identifies C. Then, z G A iff 
(3cr I content(cr) = {z})(Vr | cr ^ r A content(r) = {z})[M(cr) = M(r) A M((t) G 
N]. A contradiction to the fact that X is not in A 2 . I 

Theorem 22. Suppose J G { TxtRef, TxtRel, TxtJRef, InfRef, InfRel, 
InfJRef}. Then, J C EJ. 



4.2 Separation Results 

We now proceed to show the separation results. The next two theorems show the 
advantages of finite identification over reliable identification and identification 
with refutation. For the proof of first theorem, we need the following proposition, 
which follows immediately from definitions. 

Proposition 23. Suppose C is such that: 

(a) C G ElnfJRef, and 

(b) For all Li, L 2 G £, either Li C\ L 2 = % or Li = L 2 . 

Then £ G EInfRef (and thus in EInfRel). 

Let Mg, Ml, . . ., denote a recursive enumeration of all machines. 

Theorem 24. TxtFin — (EInfRel U ElnfJRef) yf 0. 

Proof. For each z, we will define below a nonempty language Li with the following 
properties: 

(a) Li C {(z, n) \ n€ N}; 

(b) either is not reliable, or Li ^ InfEx(Mi). 

(c) a grammar for Li can be obtained effectively in i. 

We take £ = {Li|zGiV}. Clearly, £ G TxtFin (since a grammar for Li 
can be found effectively from z). Further, (using clause (b) above) we have that 
£ ^ EInfRel. It thus follows from Proposition 14.21 that £ ^ ElnfJRef. 

We will define Li in stages below. Let £f denote Li defined before stage s. 
Let Li = {(z,0)|. Let = (z, 1). Go to stage 0. 
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Stage s 

1. Suppose, If is the canonical information sequence for Lf and J| is the ca- 

nonical information sequence for Lf U {a;'*}. 

2. Search for n > x® such that either Mi(/f[n]) Mi(/f[x®]) or Mi(/|[n]) ^ 

M,(/|[x®]). 

If and when such an n is found proceed to step 3. 

3. Let n be as found in step 2. If Mi(J|[n]) ^ Mi(J|[x®]), then let = 

Lf U {x®}; otherwise let = Lf. 

Let x®+^ G {{i, z) \ z & N} be such that x®+^ > n. 

Go to stage s -I- 1. 

End stage s 

It is easy to verify that Li can be enumerated effectively in i. Fix i. We 
consider two cases in the definition of Li. 

Case 1: There exist infinitely many stages. 

In this case on canonical informant for Li makes infinitely many mind 
changes. 

Case 2: Stage s starts but does not end. 

In this case converges to the same grammar for both Li and Li U {x®} 
(which are distinct languages). Thus is not reliable. 

The above cases show that C is not EInfRel-identified by M^. | 

Theorem 25. TxtFin — ETxtJRef yf 0. 

The following theorem shows the advantages of identification with refutation 
over finite identification. 

Theorem 26. (TxtRef fl TxtJIRef fl InfJIRef) — InfFin yf 0. 

Proof. Let £ = {L \ card(L) < 2}. It is easy to verify that £ witnesses the 
separation. | 

The following theorem shows the advantages of justified refutation and re- 
liable identification over the case when the learning machine has to refute all 
unidentified texts (informants). 

Theorem 27. (TxtRel fl TxtJIRef fl InfJIRef) — InfRef yf 0. 

Proof. It is easy to verify that FIN witnesses the separation. | 

The following theorem shows the disadvantages of immediate refutation. Note 
that for learning indexed families of recursive languages, Lange and Watson have 
shown that TxtJRef = TxtJIRef and InfJRef = InfJIRef, and thus the 
following result does not hold for learning indexed families of recursive languages. 

Theorem 28. (a) TxtRef — EInfJIRef yf 0. 

(b) TxtRef - ETxtJIRef yf 0. 
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Proof. Let C = {{i} \ i ^ K}. It is easy to verify that C S TxtRef. We 
show that C ^ ETxtJIRef. A similar proof also shows that C ^ EInfJIRef. 
Suppose by way of contradiction that M ETxtJIRef-identifies C. Then the 
following claim shows that K is r.e., a contradiction. Thus £ ^ ETxtJIRef. 
Claim 29. f S K O {3a \ content(a) = {*})[M((t) 4, fyT]. 

Proof. Suppose i G K. Then, since M TxtEx-identifies {i} G £, there must 
exist a a such that content(cr) = {f} and M((t) 4, fyT. On the other hand suppose 
by way of contradiction that i ^ K, and a is such that content (< t) = {*} and 
M((t) 4, fyT. Let T be an extension of a such that M does not TxtEx-identify 
T (there exists such a T by Proposition ??). But then, since a ^ Cons£, by 
definition of ETxtJIRef, M(cr) must be equal to T; a contradiction. This proves 
the claim, and completes the proof of the theorem. | 

In the context of learnability of indexed families, Lange and Watson (in 
their model of learning with refutation) had shown that immediate refutation is 
not a restriction, i.e. TxtJRef = TxtJIRef and InfJRef = InfJIRef. The 
following corollary shows that immediate refutation is a restriction in the context 
of learning general classes of r.e. languages! Note that this restriction holds for 
both extended and unextended models of justified refutation (for general classes 
of r.e. languages). 

Corollary 30. (a) InfJIRef C InfJRef. 

(b) TxtJIRef C TxtJRef. 

(c) EInfJIRef C EInfJRef. 

(d) ETxtJIRef C ETxtJRef. 

The following theorem shows the advantages of justified refutation over re- 
liable identification. 

Theorem 31. (TxtJIRef fl InfJIRef) — EInfRel fy 0. 

Proof. For f Gil, let £/ = {{x,y) \ f{x) = y}. Let £ = {£/ | = /} U {£ | 

L G FIN A {3x,y,z \ y fy z)[{x,y) G L A {x,z) G £]}. It is easy to verify 
that £ G TxtEx (and thus InfEx). Since, ICons£ = SEQ, it follows that 
£ G TxtJIRef n InfJIRef. Essentially the proof in j( tl IN Mt)4] to show that 
{/ I £/(o) = /} cannot be identified by a reliable machine (for function learning) 
translates to show that £ ^ EInfRel. | 

The following theorem shows the advantages of reliable identification over 
justified refutation. 

Theorem 32. (a) TxtRel — ETxtJRef fy 0. 

(b) TxtRel - EInfJRef fy 0. 



The following theorem shows the advantages of having an informant over 
texts. 
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Theorem 33. (InfRef fllnfJIRef) — TxtEx ^ 0. 

Proof. INIT U {N} witnesses the separation. | 



5 A Variation of Extended Justified Refutation Criteria 

In the definitions for (extended) criteria of learning with justified refutation, 
we required the machines to either identify or (immediately) refute any text 
(informant) which did not contain a finite sample representative of the class, 
£, being learned. For example, in ETxtJRef-identification we required that 
the machine either identify or refute every text which starts with an initial 
segment not in Cons£. Alternatively, we could place such a restriction only 
for texts which are not representative of what the machine identifies (note that 
this gives more freedom to the machine). In other words, in the definitions for 
ETxt JRef , ETxt JIRef , EInfJRef , EInfJIRef , we could have taken {T[n\ \ 
n G N and M TxtEx-identifies T}, instead of Cons£ and {I[n] \ n G N and M 
InfEx-identifies /}, instead of ICons£. Let these new classes formed be called 
ETxt JRef', EInfJRef', ETxt JIRef', EInfJIRef'. Note that a similar change 
does not effect the classes ETxtRef, ETxtRel, EInfRef, EInfRel. 

An easy to show interesting property of the classes ETxtJRef', EInfJRef', 
ETxtJIRef', EInfJIRef' is that they are closed under subset operation (i.e., if 
C G ETxtJRef', then every £' C £ is in ETxtJRef'). Note that ETxtJIRef, 
ETxtJRef, EInfJIRef, EInfJRef are not closed under subset operation — 
this follows immediately from Theorem 03 and the fact that FIN belongs to 
each of these inference criteria. 

We now show a result that ETxtJIRef' and ETxtJRef' obtain the full 
power of TxtEx! This is a surprising result given the results in [I;W94] and this 
paper (ETxtJRef and EInfJRef do not even contain TxtFin, as shown in 
Theorem El and Theorem ESJ. 

Theorem 34. (a) TxtEx = ETxtJRef' = ETxtJIRef'. 

(b) EInfJRef' = EInfJIRef'. 

Proof. For part (a), it is enough to show that TxtEx C ETxtJIRef'. Con- 
sider any class £ G TxtEx. If N G £, then it immediately follows that £ G 
ETxtJIRef' (since Cons£ = SEG). If N ^ then let £' = £ U INIT. It was 
shown by Fulk jFulQOj that, if £ G TxtEx and £ does not contain N, then £' 
as defined above is in TxtEx. Since £' contains a superset of every finite set, 
it follows that £' G ETxtJIRef' (since Cons£/ = SEG). Part (a) now follows 
using the fact that ETxtJIRef' is closed under subset operation. 

(b) It is sufficient to show that EInfJRef' C EInfJIRef'. Suppose M 
EInfJRef'-identifies £. We construct an M' which EInfJIRef'-identifies £. 
Let g denote a recursive function such that, for all finite sets S, Wg(^s) = 
any input /[n], M' behaves as follows. If M(/[n]) fy_L, then M'(/[n]) = M(/[n]) 
(this ensures that M' InfEx-identifies £). If M(/[n]) =_L, then let m be the 
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smallest number such that M(/[m]) =_L. If PosInfo(/[n]) = PosInfo(/[m]), then 
M'(/[n]) outputs (/(PosInfo(/[n])). Otherwise M'(J[n]) outputs _L. We claim that 
for every a, either M'(<t) =_L or there exists an extension / of ct such that M' 
InfEx-identifies I. So suppose M'(ct) yf_L. We consider the following cases. 
Case 1: M(cr) yf_L. 

If M InfEx-identifies some extension of cr, then clearly M' does too. So 
suppose that M does not InfEx-identify any extension of a. This implies that 
M refutes every informant which begins with cr. Let I be an informant, extending 
cr, for Poslnfo(cr). Let n be the least number such that M(/[n]) =_L. Note that 
I[n] must be an extension of ct. It now follows from the definition of M' that M' 
InfEx-identifies I. 

Case 2: M(ct) =_L. 

Let T be the smallest prefix of ct such that M'(t) =_L. It follows from the de- 
finition of M' that Poslnfo(r) = Poslnfo(cr). Let I, extending ct, be an informant 
for Poslnfo(cr). It follows from the definition of M' that M'(/) is a grammar for 
PosInfo(T) = PosInfo(/). 

From the above cases, it follows that M' EInfJIRef '-identifies C. I 
However, unlike the case for texts, EInfJRef', is not equal to InfEx. 
Theorem 35. TxtEx — EInfJRef' yf 0. 

Following corollary can be obtained from the (omitted) proof of the above theo- 
rem. 

Corollary 36. TxtJIRef — EInfJRef' yf 0. 

The following theorem, however, shows that EInfJRef' contains InfRel. 
Theorem 37. InfRel C EInfJIRef' 

Proof. Note that InfRel is closed under finite union. Also FIN € InfRel. Now 
suppose C € InfRel. Thus (£ U FIN) G InfRel C InfEx. It follows that 
(£ U FIN) G EInfJIRef' (since ICons^uFiN = SEQ). Now, EInfJIRef' is 
closed under subset operation, and thus it follows that £ G EInfJIRef'. | 



6 Conclusions 

Mukouchi and Arikawa modeled a learning situation in which the learner is ex- 
pected to refute texts which are not representative of £, the class of languages 
being identified. Lange and Watson extended this model to consider justified 
refutation in which the learner is expected to refute texts only if it contains a 
finite sample unrepresentative of the class £. Both the above studies were in the 
context of indexed families of recursive languages. In this paper we extended 
this study in two directions. Firstly, we considered general classes of recursively 
enumerable languages. Secondly, we allowed the machine to either identify or 
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refute the unrepresentative texts (respectively, texts containing finite unrepre- 
sentative samples). We observed some surprising differences between our results 
and the results obtained for learning indexed families by Lange and Watson. 
For example, in the context of learning indexed families of recursive languages, 
Lange and Watson (in their model) showed that TxtJIRef = TxtJRef (i.e. 
requiring machines to refute as soon as the initial segment becomes unrepre- 
sentative of £, is not a restriction). Similar result was also shown by them for 
learning from informants. We showed that requiring immediate refutation is a 
restriction if we consider general classes of r.e. languages (in both our (exten- 
ded) and Lange and Watson’s models of justified refutation, and for learning 
from texts as well as informants) . We also considered a variation of our model in 
which “unrepresentative” is with respect to what a machine identifies and not 
with respect to the class C. In this variation, for learning from texts, (immediate) 
justified refutation model has the same power as TxtEx — a surprising result 
in the context of results in and other results in this paper. However, in 

the context of learning from informants, even this variation fails to capture the 
power of InfEx. 

It would be useful to find interesting characterizations of the different infe- 
rence classes studied in this paper. An anonymous referee suggested the following 
problems. When we do not require immediate refutation (as in TxtJRef) the 
delay in refuting the text may be arbitrarily large. It would be interesting to 
study any hierarchy that can be formed by “quantifying” the delay. Note that 
if one just considers the number of excess data points needed before refuting, 
then the hierarchy collapses — except for the * (unbounded but finite) case. As 
extensions of criteria considered in this paper, one could consider the situation 
when a machine approximately identifies lkVh,^lkVf)7IIVCTMI a text T in the 
cases when it doesn’t identify or refute a text. 
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Abstract. In the setting of learning indexed families, probabilistic lear- 
ning under monotonicity constraints is more powerful than deterministic 
learning under monotonicity constraints even if the probability is close 
to 1 provided the learning machines are restricted to proper or class 
preserving hypothesis spaces (cf. m)- In this paper, we investigate the 
relation between probabilistic learning and oracle identification under 
monotonicity constraints. In particular, we deal with the question how 
much “additional information” provided by oracles is necessary in order 
to compensate the additional power of probabilistic learning. 

If the oracle machines have access to /C-oracle, then they can compen- 
sate the power of monotonic (conservative) probabilistic machines com- 
pletely, provided the probability p is greater than 2/3 (1/2). Further- 
more, we show that for every recursively enumerable oracle A, there 
exists a learning problem which is strong-monotonically learnable by an 
oracle machine having access to A, but not conservatively or monoto- 
nically learnable with any probability p > 0. A similar result holds for 
Peano-complete oracles. However, probabilistic learning under monotoni- 
city constraints is “rich” enough to encode every recursively enumerable 
set in a characteristic learning problem, i.e., for every recursively enu- 
merable set A, and every p > 2/3, there exists a learning problem Ca 
which is monotonically learnable with probability p, and monotonically 
learnable with oracle B if and only if A is Turing-reducible to B. The 
same result holds for conservative probabilistic learning with p > 1/2, 
and strong-monotonic learning with probability p = 2/3. In particular, 
it follows that probabilistic learning under monotonicity constraints can- 
not be characterized in terms of oracle identification. Moreover, we close 
an open problem that appeared in H21 by showing that the probabilistic 
hierarchies of class preserving monotonic and conservative probabilistic 
learning are dense. 

Finally, we show that these probability bounds are strict, i.e., in the 
case of monotonic probabilistic learning with probability p = 2/3, con- 
servative probabilistic learning with probability p = 1/2, and strong- 
monotonic probabilistic learning with probability p = 1/2, K, is not 
sujficient to compensate the power of probabilistic learning under mo- 
notonicity constraints. 



M. Richter et al. (Eds.): ALT’98, LNAI 1501, pp. 306-^5^ 1998. 
Springer- Verlag Berlin Heidelberg 1998 



Comparing the Power of Probabilistic Learning... 



307 



1 Introduction 

Many human learning processes are inductive, i.e., the learner tries to generate 
a solution of a problem, a concept, or a grammar for a language on the basis of 
incomplete or ambiguous information. In order to understand the special quality 
of inductive learning, it turned out to be useful to investigate abstract learning 
models which try to reflect the human ability to learn natural languages. 

A well studied approach in this held is the theory of formal language learning 
first introduced by Gold P| . The general situation investigated in language iden- 
tification in the limit can be described as follows. An inductive inference machine 
is an algorithmic device that is fed more and more information about a language 
to be inferred. This information can consist of positive and negative examples or 
only positive ones. In this paper we consider the case where the learner is fed all 
strings belonging to the language to be inferred but no other strings, i.e., learning 
from text. When fed a text for a language L, the inductive inference machine 
has to produce hypotheses about L. The hypotheses the learner produces have 
to be members of an admissible set of hypotheses; every such admissible set is 
called hypothesis space. The hypothesis space may be a set of grammars or a set 
of decision procedures for the languages to be learned. Finally, the sequence of 
hypotheses has to converge to a hypothesis correctly describing the language L 
to be learned. If the learner converges for every positive presentation for L to a 
correct description of L, then it is said to identify the language in the limit from 
text. A learner identifies a collection of languages in the limit from text if and 
only if it identifies each member of this collection in the limit from text. 

With respect to potential applications, we do not consider arbitrary collec- 
tions of recursive languages but restrict ourselves to enumerable families of re- 
cursive languages with uniformly decidable membership, i.e., indexed families of 
uniformly recursive languages (cf. Q, ^7]; EBj and the references therein). 

As mentioned above, we require the learners to produce grammars for the 
languages to be learned. However, we do not allow every set of grammars as 
hypothesis space but only enumerable families of grammars with uniformly de- 
cidable membership (cf. for example f31|L Let C = Lq,Li, ... be an enumerable 
family of target languages. Obviously, C itself may be used as hypothesis space. 
This leads to the notion of proper learning, i.e., a learner identifies C properly 
if it learns C with respect to C itself. Since the requirement to learn properly in 
general leads to a decrease of the learning power, we additionally consider class 
preserving probabilistic learning, i.e., C has to be inferred with respect to some 
hypothesis space having the same range as C. Since it may be appropriate to 
allow the learner to “construct” new hypotheses or to “amalgamate” hypotheses 
already guessed to new hypotheses during the learning process until a correct 
description of the language to be learned is found, we also consider the case of 
class comprising learning. Thereby, C is identifiable with respect to a class com- 
prising hypothesis space if and only if there are an inductive inference machine 
M and a hypothesis space Q which has a range comprising range{C) such that 
M learns C and only chooses hypotheses from Q. For more information about 
the impact of the hypothesis space on the learning power of inductive or pro- 
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babilistic inference machines, we refer the reader, for example, to d], m or 

m 

When observing human inference processes, we notice that people often ac- 
cept that their learning processes fail with a certain probability in order to gain 
learning power. Moreover, they enhance their learning capabilities by using ex- 
ternal sources such as databases or teachers. Finally, people use various learning 
strategies to “improve” their hypotheses, i.e., they reject a hypothesis only if they 
are convinced that the new hypothesis is “better” than the previous conjecture. 
Obviously, the learning model described above does not reflect these human abi- 
lities. Thus, it seems only natural to define and investigate learning models which 
try to formalize the special characteristics of human inference processes. For an 
overview on the various modifications and refinements of the learning model of 
Gold investigated in the last decades, see 121, m or m- 

In this paper, we deal with probabilistic learning models and formal models 
of learning with additional information. In both cases, we claim the learning 
machines to fulfil monotonicity constraints as learning strategies. 

Generalization strategies belong to the most important learning heuristics 
that are used to guarantee the improvement of the hypotheses during the lear- 
ning process. Thereby, a learning algorithm generalizes on a presentation for a 
language L provided it starts by hypothesizing a grammar for a language “smal- 
ler” than the language L to be learned, and “refines” this hypothesis gradually 
until a correct hypothesis for L is found. Jantke ra defined the strongest no- 
tion of generalization, namely strong-monotonicity. Thereby, the learner, when 
successively fed a text for the language to be inferred, has to produce a chain 
of hypotheses such that Li C Lj in case j is guessed later than i. Since strong- 
monotonicity is a very restrictive constraint on the behavior of an inductive 
inference machine (cf. jl ti|l. There are several weaker formalizations of the gene- 
ralization principle. One of them is due to Wiehagen m, namely monotonicity. 
Informally, the learner, when successively fed a text for the language L to be 
inferred, learns monotonically if it produces a chain of hypotheses such that for 
any two hypotheses, the hypothesis produced later is as least as good as the 
earlier one with respect to L. More precisely, we require that LiD L C Lj D L, if 
j is conjectured after i. Furthermore, we consider weak-mono tonic learning (cf. 
0 ). Weak-monotonicity can be described as follows. If the learner conjectures 
j after i and the set of strings seen by the learner when j is guessed is a subset of 
Li, then Li C Lj. For more information about monotonic learning of recursive 
or recursively enumerable languages, we refer the reader to HD, PI, m, m, 
and m Notice that in the setting of indexed families, weak-monotonicity is 
equivalent to conservative learning as defined in 

Probabilistic inference of recursive functions was introduced by Freivalds |n|, 
and further investigated, for example, by Pitt m, and Wiehagen et al. m, I2D!. 
In many cases (cf. e.g. 0, and I2H), the probabilistic learning models investiga- 
ted induce probabilistic hierarchies with a “gap” . In particular, each collection of 
recursive languages identifiable from text with probability p > 2/3 is determini- 
stically identifiable (cf. E5)- Within the setting of probabilistic inference under 
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monotonicity constraints, the picture completely changes provided the machines 
are restricted to proper or class preserving hypothesis spaces (cf. m)- It turned 
out that the learning capabilities of probabilistic inference devices working under 
monotonicity constraints are strictly larger than the learning capabilities of their 
deterministic counterparts even if the probability has to be close to 1. Moreover, 
the learning power is strictly decreasing when the probability increases. 

In order to describe how much learning power is gained, we investigate how 
much additional information is necessary for a deterministic learner to achieve 
at least the learning power of its probabilistic counterpart. There are several 
formalizations of learning with additional information, for example learning by 
teaching (cf. e.g. [7|, 0)), or oracle identification (cf. 0, ^7]). In the 

setting of oracle identification, the learning machines are allowed to ask questions 
of the form “x G A” to an oracle A C IN. Thereby, the information given by the 
oracle is independent from the problem to be learnt. For example, the learner 
may ask questions to the “halting problem” 1C. In this paper, we restrict ourselves 
to oracles which are either recursively enumerable or Peano-complete. 

2 Preliminaries 

We denote the natural numbers by IN = {0, 1,2,.. .}. Let Mg, Mi , ... be a stan- 
dard list of all Turing machines, and let ■ be the resulting acceptable 

programming system, i.e., tpi denotes the partial recursive function computed 
by Mi. Let d>o, <Pi, ... be any associated complexity measure (cf. |3|). Without 
loss of generality we may assume that > 1 for all A:,a; G IN. Furthermore, 

let k,x gTN. If (fik{x) is defined, we say that (pk(x) converges and write (pk{x) J,; 
otherwise (pk{x) diverges and we write ipk{x) t- 

Let A, i? C IN. For the complement of A in IN, we write A. A is Turing- 
reducible to B (A <T B) if and only if the characteristic function XA of A 
can be computed by a machine which has access to an infinite database which 
supplies for each x G IN whether x £ B or not. Such a database is called an 
oracle. In particular, the set K. := {k\ipk(k) |} is an oracle. By TOT, we denote 
the set {k\ipk T total}. The class {A\A =x B} is called the Turing degree of 
B. An oracle A is said to be Peano-complete, if every disjoint set of recursively 
enumerable sets can be separated by an A-recursive function. In the sequel, we 
assume familiarity with formal language theory (cf. [II 1)|L For more details about 
sets and Turing-Reducibility, we refer the reader to Odifreddi or Soare (cf. |2Z| 
or pn|L 

Let T be any fixed finite alphabet of symbols and let E* be the free monoid 
over E. Any subset L C A* is called a language. Let L be a language, and 
let s = sq,si,... be a finite or infinite sequence of strings from E*. Define 
rng(s) := {sk\k G IN}. An infinite sequence r = sojSi:-- - of strings from E* 
with rng(r) = L is called a text for L. For a text r and a number x, let Tx 
be the initial segment of r of length x -I- 1. Following Angluin Lange, 

Zeugmann and others (cf., e.g., we exclusively deal with the learnability of 
indexed families of uniformly recursive languages defined as follows. A sequence 
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C = (Lj)jgiN is said to be an indexed family of uniformly recursive languages 
provided Lj 0 for all j S IN, and there is a recursive function F such that for 
all j S IN and s S H*: 



s) 



1, if s £ Lj, 

0, otherwise. 



In the following, we refer to indexed families of uniformly recursive languages as 
indexed families for short. 

Now we will precise the learning models considered in this paper. Let C be 
an indexed family. An inductive inference machine (abbr. IIM) is an algorithmic 
device that takes as its input a text for a language L £ £. When fed a text 
for L, it outputs a sequence of grammars. The hypotheses the IIM outputs 
have to be members of an admissible set of hypotheses; every such set is called 
hypothesis space. In this paper, we do not allow arbitrary sets of hypothesis as a 
hypothesis space but only enumerable families of grammars Go, Gi, G 2 , . . . over 
the terminal alphabet S such that rng(£) C {L(Gj)|j £ IN}, and membership 
in L{Gj) is uniformly decidable for all j £ IN, and all strings s £ S* . If an IIM 
M outputs a number j, then we are interpreting this number to be the index 
of the grammar Gj, i.e., M guesses the language L(Gj). For a hypothesis space 
G = (L(Gj))jg]N, we use rng(C/) to denote {L(Gj)|j £ IN}. G is called class 
comprising, if rng(£) C rng(C/), and class preserving, if rng(£) = rng(C/). 

If, for any text for L, M outputs a sequence of grammars that converges 
to a grammar correctly describing L, then M is said to identify the language 
in the limit from text. This learning paradigm is called identification in the 
limit and was introduced by Gold 0. By LIM, we denote the collection of all 
indexed families C that can be identified in the limit with respect to a class 
comprising hypothesis space G. For more information about inductive inference 
and inductive learning of indexed families, we refer the reader to m and m 
for an overview. 

In this paper, we consider a probabilistic modification of this concept, namely 
probabilistic inductive inference (cf., e.g., 0, j^, |ZS1)- A probabilistic inductive 
inference machine (abbr. PIM) is an algorithmic device equipped with a t-sided 
coin. A PIM P takes as its input larger and larger initial segments of a text r 
and it either takes the next input string, or it first outputs a hypothesis, i.e., a 
number encoding a certain computer program, and then requests the next input 
string. Each time, P requests a new input string, it flips the t-sided coin. The 
hypotheses produced by P, when fed a text r, depend on the text seen so far 
and on the outcome of the coin flips. 

Let P be a PIM equipped with a t-sided coin. An coin-oracle c is an infi- 
nite sequence cq,ci, . . . where Ci £ {0,...,t — 1}. By c", we denote the initial 
segment cq, . . . , c„ of c for all n G Af. Let c be an coin-oracle. We denote the 
deterministic algorithmic device defined by running P with coin-oracle c by P'^. 
By (tx), we denote the last hypothesis P outputs, when fed Tx, under the 
condition that the first a; -I- 1 flips of the t-sided coin were c^. If there is no 
such hypothesis, then P'^ (tx) is said to be T. The sequence (P'^ is 
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said to be a converging path. We say that (P'^ (Li;))a;g]N converges in the li- 
mit to the number j iff either there exists some n G IN with (r^) = j 

for all X > n, or (P° (ra;))a;g]N is finite and its last member is j. Let G he & 

hypothesis space. (P'^ (Li;))a;g]N is said to converge correctly with respect to G 

iff (P"^ (ra;))a;GN converges in the limit to a number j and L{Gj) = L. Now 
let Pr denote the canonical Borel-measure on the Borel-cr-algebra on {0, . . . , 
t — 1}°°. For more details about PIMs, measurability and infinite computation 
trees we refer the reader to Pitt (cf. | 22 ]). 

In the following, we define probabilistic inference under monotonicity con- 
straints. In general, this notion is defined for inductive inference machines. Due 
to the lack of space, we directly give the definitions for probabilistic inductive 
inference machines and refer the reader to fQ, and [3U for more information 
about deterministic learning under monotonicity constraints. 

Let c be an coin-oracle, let r be a text for a recursive language L, and let 
P be a PIM. Then the path (P° (Ta;))a;g]N is said to be strong-monotonic if and 
only if for all a;, fc S IN, fc > 1 with P^*(ra;) y^_L, L{G pc^ C L{G p^x+k 

It is called monotonic if L{G pc^ D L C L{G p^x+k fl L. Moreover, 

it is called weak-monotonic if the following holds: if rng{Tx+k) ^ (tx))i 

then L{Gpc^(^x^)) ^ L{G p^x+k Notice that in the settings of probabilistic 
learning and oracle identification of indexed families, weak-monotonicity is equi- 
valent to conservative learning as defined in p. Thereby, a path (P'^ (Tx))a;GiN 
is said to be conservative if and only if for all a:, /c G IN, fc > 1 with P^ (t^) 
holds: ^ P“'' {Tx-\-k), then ing{Tx+k) 2 ^(G'pc^(r„))- In the sequel, we 

only deal with conservative learning (cf. P). 

Let pi G {SMON , MON , COV}. {Tx))xem is said to Gpi-converge cor- 
rectly with respect to G iff (P^ (Pa;))a;GiN fulfils the condition pi, and converges 
correctly with respect to G- Now we are ready to define probabilistic learning 
under monotonicity constraints. 

Definition 1. Let C be an indexed family, let L be a language, let G be a class 
comprising hypothesis space, and let p G [0,1]. Let p, G {SMON , MON , COV}. 
Let P be a PIM equipped with a t-sided coin. Set 

:= { c I (P° (Ta;))xG]N Cp — Converges correctly w.r.t. G }■ 

P C pprob{p) -identifies L from text with probability p with respect to G if and only 
if Pr{Sr) > p for every text r for L. P C pprob{p) -identifies C with probability 
p with respect to G iff P C p(p)p>rob-identifies each L G rng(£) with probability 
P- 

Let p G [SMON , MON , COV}. By Cpprob{p), we denote the collection of all in- 
dexed families jC that can be (7/iprob(p)-identified with probability p with respect 
to a class comprising hypothesis space G- Pprob{p) is the collection of all inde- 
xed families C that can be C/Xprob (p)-identified with probability p with respect 
to a class preserving hypothesis space G- Furthermore, Eppxob{p) denotes the 
collection of all indexed families that can be learned properly with probability 
p. More exactly, C G Epprob(p) iff G is C/iprob (p)-identifiable with probability 
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p with respect to C itself. The corresponding deterministic learning classes are 
denoted by C/r, /i and Ep. 

Finally, we have to define oracle identification. Let A be an oracle. An oracle 
inference machine (abbr. OIM) is an inductive inference machine M which has 
access to an oracle A. We denote an OIM M having access to A by M[A]. 

Definition 2. Let C he an indexed family, let L be a language, and let Q be 
a class comprising hypothesis space. Let p G {SMON , MON , COV}. Let A be 
an oracle, and let M[A] be an OIM. Then M[A] C p-identifies L from text with 
respect to Q if, for every text t for L, the sequence (Lgm[a](tj,))3:6]n C p-converges 
correctly with respect to Q. M[A\ C p-identifies C with respect to Q iff M[A] Cp- 
identifies each L G rng(£). 

By Cp[A], we denote the collection of all indexed families £ that can be Cp- 
identified by an OIM M[A] with respect to a class comprising hypothesis space. 
p[A], and Ep[A] are defined analogously. 

In the following sections, we often need a special set of recursive languages 
which encodes the halting problem (cf., e.g., PZj). Let fc G IN. Define 
Lk := G IN}, and 

Ti / Afc, if <pk{h) }, 

\{a^b'^\m <T>k{k)}, if ipk{k) i ■ 

3 Comparing the Power 

3.1 The Power of Oracle Identification 

In we showed that CCOV prob{p) = CCOV, and CSMON prob{p) = CSMON 
for all p > 1/2. Furthermore, CMON prob{p) = CMON for all p > 2/3. Conse- 
quently, in the case of class comprising learning, the learning capabilities of 
/C-oracle machines working under monotonicity constraints are larger than the 
learning capabilities of their probabilistic counterparts, provided the probabi- 
listic learners are claimed to learn with probability p > 1/2 (p > 2/3 in the 
monotonic case). The following theorem yields the same result holds for proper 
and class preserving probabilistic learning. 

Theorem 1. Let p > let p € {SMON, COV} be a monotonicity constraint 
and let C be an indexed family such that C is pprob{p) -identifiable with probability 
p with respect to a class preserving hypothesis space Q. Then C is p-identifiable 
with respect to Q by an oracle machine which has access to K. Moreover, every 
indexed family C which is MON prob{p) -identifiable with respect to a class preser- 
ving hypothesis space Q with probability p > | js MON -identifiable with respect 
to Q by an oracle machine which has access to K. 

Proof. Due to the lack of space, we omit the proof. For details see Ell- 

Next, we show that for every recursively enumerable oracle A, there exists an 
indexed family which is strong-monotonically identifiable by an oracle ma- 
chine having access to A, but not conservatively or monotonically identifiable 
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with any probability p > 0 with respect to any class comprising hypothesis space 

g. 

Theorem 2. Let A be a reeursively enumerable oracle, A not recursive. There 
exists an indexed family C with 

a. G ESMON[A], 

b. ^ CCOV probip) for all p G [0, 1], 

c. ^ CMONprob(p) for all p G [0, 1]. 

Before proving Theorem |21 we note a technical result. 

Lemma 1. Let p G {SMON, MON, COV}. Let C be an indexed family, let A 
be an oracle, and let C G SMON [A\\C Pprob{p) for some probability p < 1. Then 
there exists an indexed family CJ such that C G SMON[Af\ \ Cp{q) for every 
gG [0,1]. 

Proof. Due to the lack of space, we omit the proof. 

Proof, (of Theorem Ej) 

It is sufficient to prove the claim for p > 1/2 (in b.), and p > 2/3 (in c.), 
since Lemma ^ yields the result for arbitrary p G [0, 1]. 

Let A be recursively enumerable, A not recursive. Let Ea be an algorithm 
which enumerates A. By EA{n), we denote the n+l-th element of A generated by 
Ea. Define an indexed family £ as follows. Let ( , ): IN x IN — >■ IN be an effective 
encoding of IN x IN. Then set := £fc iff fc ^ {^^(O), . . . , EaU)}. If fc G A, 

then add all subsets of {a^E\i < Ef^^{k) + 1}. A similar construction for A = K. 
can be found in im or m By applying the proof techniques developed therein, 
we can easily show that £^ := witnesses the desired separation. 

Notice, that the indexed family defined in the proof of Theorem O is in every 
oracle learning class A/r[A], A G {E, e, C}, p G {SMON, MON , COV}, but not in 
the probabilistic learning classes Xpprob{p), A G {E,e,C}, p G {SMON, COV}, 
p > 1/2, and \MON prob{p), A G {E, e, C}, p > 2/3. In particular, Cp[A]\Cp ^ 0 
for all p G {SMON, MON , COV}, and A recursively enumerable. 

Let p G {SMON , MON , COV }, and let A be a recursively enumerable oracle. 
In the following, we show that, for p > 1/2, p G Q, (p > 2/3 in the monoto- 
nic case), it is possible to separate Epp^obip) ond Ep[A] simultaneously from 
Epprob{<l)i <1 > P- Thereby, Q denotes the set of rational numbers. Moreover, 
the proof of the following theorem yields an analogous result for Peano-complete 
oracles. 

Theorem 3. Let A be a recursively enumerable oracle, A not recursive. Let 
c,d G IN, gcd{c,d) = 1. Then there exists an indexed family £ G ESMON[A] 
with 

£ G ESMONprobi^) \ Up<g<i ECOVprob(q), and 
£ G EMONprobi^) \ Up<9<i EMONprobiq). 
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Proof. Let A be a recursively enumerable oracle, A not recursive, and let Ea be 
an algorithm which enumerates A. Let p > 1/2. Let c, d S IN with gcd{c, d) = 1. 
Let (, ):INxIN— >-]Nbean effective encoding of IN x IN. Let k,ki,k 2 S IN, 
k = (fci,fc 2 ), and let j G IN, j < c — 1. Finally, we set = {S'|5' C 

{0, . . . , c - 1}, 151 = 2c - d}. Let cod^^c-d ■ ^^ 2 c-d -t {0, ■ • ■ , ( 2 ^! J - 1} be an 

effective encoding of Then define mod 2 c-d : IN — t {0, ■ . . , { 2 c-d) ~ 1} 

by setting mod^c-Av) ■= x iff a; G {0, . . . , ( 2 ^! J - 1} A y = x mod ( 2 ^!^) 

for all 2 / G IN. In order to define the family witnessing the desired separation, 
we define a recursive relation Rk uniformly for fc G IN as follows. Let k,n G IN. 
Rk(n) if and only if there exists an m < n with Ea(ji) = k\ and d>k2{k) = m. 
Let k,j G IN. Define L(^k,j) as follows. Let n G IN. 

If -ii?fc(n), then G L(k,j)- 

If i?fe(n), then of IP G Li^k,j) if and only if j ^ (cod 2 c-d)~^(™o'^L-d(‘/afe 2 (^)))- 

Obviously, Cc/d '■= j<c-i is an indexed family. It immediately fol- 

lows that Cc/d € ESMON[A\, and hence in EMON[A\ and ECOV[A], Moreover, 
Cc/d G ESMONprob{c/d). By using the proof technique developed in we can 
show that Cc/d is not ECOV probipQAeaxnahle with probability p > c/d. 

Let i? be a Peano-complete oracle. Since the sets {/ G IN|j mod n = i mod n}), 
i < n — 1, are separable for all n G IN by any Peano-complete oracle, the indexed 
families constructed in the proof of Theorem 0 are properly strong-monotonically 
identifiable by an oracle machine having access to B. By applying Lemma^ we 
can draw the following corollary from Theorem 0 

Corollary 1. Let B be a Peano-complete oracle. Then 
ESAION[B] \ ECOVprobip) 7 ^ ® Z®’’ every p G [0, 1], and 
EMON[B] \ EMONprobip) 7 ^ 0 for every p G [0, 1]. 



3.2 Characterizing Recursively Enumerable Oracles 

In this section, we investigate the relation between the probabilistic learning 
classes Pprobip) and p[A] for A <t Let A be an oracle, A <t 1C. For p > 2/3, 
SMON prob{p) C SMON[A\, since SMONprob{p) = SMON for every p > 2/3 (cf. 
m)- However, the probabilistic learning class SMONprobi"^/^) is able to encode 
every recursively enumerable oracle. 

Let A be recursively enumerable, A not recursive, and let Ea be an algorithm 
which enumerates A. Let ( , ):1N x {0,1} — >■ IN be an effective encoding of 
IN X {0, 1}, and let k,j G IN, j < 1. Set 
_ / L'fc U |a'^6^A'(D+i}^ j ^ 

\L/U|a'=6^A'W+2}, if j = l. 

Obviously, Ca = {Li^k,j))k,j^m,j<i is an indexed family. Let B be an oracle. By 
using a known argument from ca. we can show that Ca G SM0N\B] if and only 
if A <T B. Consequently, SMON prob{‘2-/'A) and SMON[B] are not comparable 
for every B <t 1C. 
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In the case of monotonic (conservative) class preserving learning, we are able 
to show an analogous result for every p > 2/3 {p > 1/2). Thus, the probabilistic 
learning classes MONprob(p), P > 2/3, and COVprob{p), P > 1/2, are “rich” 
enough to encode every recursively enumerable set. 

Theorem 4. Let A he a reeursively enumerable oraele, A not reeursive. Let 
n, s £ IN such that s + 1 is a factor of n. Then there exists an indexed family 
d^n,s £ EMONprob( 2 n+s ) that every OIM M[B] which monotonically iden- 
tifies Cn,s with respect to a class preserving hypothesis space may be transformed 
into a decision procedure for A. 

Proof. Let Ej^ be an algorithm enumerating A. Let 0 £ IN with n = z(s + l). Let 
( , ): lNx{0, . . . , (n+ 2 :(®^^)) — 1} — >• IN be an effective encoding of lNx{0, . . . , (n+ 
z(^+i)) - 1}. Let = {j £ lN|r(s+ 1) < j < (r+ l)(s+ 1)}. Let (l?r),<(.+i)_i 
be an effective enumeration of all subsets of Dr with cardinality 2. Let k,j £ IN, 
j < + ■z(* 2 ^)) — 1. If a: ^ then set L(fcj) := Lfe for all j < (n + — 1. 

If X £ T, and j < n — 1, then set L(fej) := L/ U bfc)+(i+i)}. If x £ A, 

and j > n, then let r £ IN, 0 < r < z — 1 with n + < j < (n + 

(r + 1)(*^^))- Set L(fcj) := L/ U bfc)+(™+i) |j 7 j g D/}. It follows that 

(-^(fe,f))fc jg]N j<(ra+z('’+i))_i witnesses the desired separation. 

By using some arguments from we may even show that an oracle machine 
M[B] already decides A in case it is claimed to identify Cn,s with a probability 
p > 2n/{2n + s). 

Theorem 5. Let A be a recursively enumerable oracle, A not recursive. Let 
n, s £ IN such that s + 1 is a factor of n. Then there exists an indexed family 
^n,s £ EMONprob ( 2 n+s ) that cvcry probabilistic OLM M[B] which mo- 
notonically identifies Cn,s with a probability p > with respect to a class 

preserving hypothesis space may be transformed into a decision procedure for A. 

In particular, we can follow that every probabilistic learning class contains ma- 
ximal complicated problems. 

Corollary 2. For p > |, there exists an indexed family £ MON prob{p) 
such that every oracle machine M[A\ identifying C// can be transformed into 
a decision procedure for 1C. Ln particular, EMON prob{p) \ MON[A\ ^ 0 for all 
A <x /C. 

From Theorem 0 and Corollary 0 follows that probabilistic monotonic learning 
with probability p > 2/3 cannot be characterized in terms of oracle identification. 

Corollary 3. Let A <t 1C, and let p€ [0,1]. Then EMONprob{p) and MON[A] 
are not comparable. The same result holds for ECOV prob{p) ^m-d ECOV[A]. 

Since the set D = {m £ [2/3, 1] | 3n, s£lN l<s<n with m = 2n/(2n -b s) } 
is dense in the interval [2/3, 1], it follows from Theorem 0 that the probabilistic 
hierarchy in the case of class preserving monotonic probabilistic learning is dense. 
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Corollary 4. (MOiVprob(p))pG[o, 1 ] is dense in the interval 1]. 

For conservative probabilistic learning with probability p > 1/2, the analogous 
results follow from the results proved in !2D|. 

Remark 1. Let p = COV or /i = MON . In the last theorems we defined for 
every recursively enumerable oracle A a learning problem which encodes A. 
This technique can be used to define an indexed family encoding an uniformly 
recursively enumerable set of recursively enumerable sets (Ai)jgiN, i.e., there is 
an indexed family C such that for every oracle machine M[B] /r-identifying this 
family holds: i? for alH G INf. 

In j2Zj, Stephan proved that LIM\A\ = C0V\1C\ for every low r.e. oracle. Thus, 
LIM\Ai\ = LIM\B] for any two low r.e. enumerable oracles A^B. In the case of 
inductive inference under monotonicity constraints, we can conclude that p[A\ ^ 
p[B] for all low r.e. oracles A^B with A B. Furthermore, we can draw the 
following corollary. 

Corollary 5. Let p. G {SMON , MON , COV}. Let A,B be oracles. If A is r.e., 
then p[A\ C p[B] if and only if A <t B. 

3.3 Peano- Complete Oracles Revisited 

Let p G {MON, COV}. By modifying the indexed families defined in Theorem 
0 we can immediately follow that in every probabilistic learning class Pprob(p)t 
there are learning problems which separate Pprob{p) from Pprob{<l), <1 > P, and 
which are conservatively identifiable by an oracle having access to a Peano- 
complete oracle. 

Theorem 6. Let n, s G IN, 1 < s < n. Then there exists an indexed family 
Cn,s e ECOVprobi^,)\\Jg>n/iu+s) COVprob{q), and Cn,s € ECOV[B] for all 
oracles B, B Peano -complete. 

Let n, s G IN, where s + 1 js a factor of n. Then there exists an indexed family 
G EMONprobi^,)\[jg> 2 u/( 2 u+s) MONprobiq), andU^s e EMON[B] for 
all oracles B, B Peano-complete. 

Proof. Let n, s G IN, s < n. Let ( , ) be an effective encoding of IN x IN, and 
let k,j G IN. Let z G IN, and let r G {0, . . . , (") — 1} with j = + r. Let 

enumeration of all subsets of the set {0, . . . ,n — 1} of 

cardinality s. 

If 'Pk{k) 4-, Tk{k) = 0 mod 2 and z < <Pk{k) — I, then set 

L(kj) = {aH™|m < L>k{k)} U {a^b^i>Ak)+{r+i),i) | * g 

if L>k{k) 4, Pk{k) = 0 mod 2, L>k{k) < z < 2d>k{k) — I, and r <n then set 

i(fc.i) = < ^k{k) -z} U {akb^'^Ak)+{r+i),j)j^ 

if d>k{k) i, Pk{k) = 0 mod 2, L>k{k) < z < 2<Pk{k) — I, and r > n, then set 
L(^k,j) = Ek, if d>k{k) I and z > 2T>k{k), or / and <pk{k) = I mod 2, or 

L>k{k) t, then set L(fcj) = Lfc- En = {Li^k,j))k,j£m is an indexed family which 
fulfils the desired conditions. The second part of the theorem can be proved by 
combining the proof of the first part and the proof of Theorem g] 
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Remark 2. We suggest that it is possible to characterize the class of Peano- 
complete oracles by a problem CpA, i-e., CpA is /^-identifiable by an oracle 
machine M[A] if and only if A is Peano-complete. 

4 Verifying the Bounds 

In the section, we show that /C is not sufftcient to compensate the power of 
probabilistic learning under monotonicity constraints. 

Theorem 7. Let A be an oracle with K. <p A. 

There exist an indexed family € ECOVprob(^/2) fi ECOVlfTOT] with 

£1/2 g EC0V[A] if and only if TOT <t 4, 

and an indexed family € EMON prob{2/3) fi EMON['TOT^ with 

£2/3 g EM0N[A] if and only if TOT <t A. 

Proof. In order to prove the first part of the theorem, we define an indexed 
family as follows. Let (, ):INxIN— i-lMbean effective encoding of IM x IM. 
We define for each fc G IN a chain of languages (£(fc,i))iGiN in dependence from 
Tk{j), j G IN. Let E := {a, b, d}. Let ( , ): IN x IN — >• IN be an effective encoding 
of IN X IN. Define the indexed family (£(fe,j))fcjg]N as follows. Let fc,n G IN. 

If d>k{0) > n, then G for all j G IN. 

Assume ^fc(O) < n. Let j G IN, j > n. In case ^fc(O) -b H-^fc(l) > j, add to 
L{k,o) bnt not to 1 < m < j. If <Lk{0) -b 1 -b < j, compute the least 

s G IN, s > 1, such that E*=o(‘^fc(*) + l) ^ d Ei=o(^fc(*) + l)+^fe(®+l) > T 

Remark that in this case (pk{i) i for alH G {0, . . . , s}. We define whether a^b^, 
d™ belong to for n,u < j, m € IN. 

— a^b^ € Li^k^Qy 

^ + 1) < n < + 1) + ^kijn) for an m < s, then 

a'^6” G Li^k,u) for all YJiLoi.'^kii) + 1) < w < j- 

— If + 1) ^ n, then a^&” ^ for all 1 < u < j. 

— d g forallX;i=d(^fc(/) + l) <u< Ei=d(^fc(*) + 1) + ^fe(s)- 

Set £i/^ = Obviously, £i/^ is an indexed family, and £i/^ G 

ECOVprob{d/2). Now let M[A] be an oracle machine which conservatively identi- 
fies £i/^. Let fc G IN, and let r = (a^6*)ig]N be a text for £(fe,o) • Then £(fc,o) = Lk 
independently from ipk being total or not. Moreover, no other language equals 
Lk, since in both cases, every language £(fc.j)) j > 0, is finite. Since M[A] iden- 
tifies Lfe, there must be an uq G IN such that M[A](t„j,) = 0. Let £ be the 
least natural number such that X)i=i d^k{i) + 1 > n^. Then either ipk is total or 
min{r G lN|(/?fe(r) f} < £■ Since K. <t A, it follows that TOT <t A. 

In order to prove the second part of the theorem, define (£(fc,j,t,))fc,j,t,e]N,i>G{o,i} 
as follows. Let k,j,v,n G IN, w G {0, 1}. If <?fc(0) > n, then G for 
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all m,j<n,vG {0, 1}. Assume <l>k{0) < n. Let j S IN, n < j. In case ^fc(O) + 
2 + > j, add to L(^k,u,o) but not to L(^k,u,i), add 

to L^k,u,i) but not to L^k,u,o)t for 1 < u < ^fc(O). If n > ^fc(O) + 2, we add 
to every L(^k,u,v) ioi 1 < u < j, v G {0, 1}. Add no other elements to any 
language. In case <?/c(0) + 2 + ^fc(l) < j, compute the least s S IN, s > 1 such 
that J2iZo i^k{i) + 2) + ^fc(s) < j and J2Uoi^k{i) + 2) + + 1) > j. 

We define whether d"* belongs to L^k,u) for n,u < j. 

- G L{k,o)- 

- If S™o^(^fe(*) + 2) + ^fe(ru) < n < I]™ o^(^fc(*) + 2) +^fe(ur) for an m < s, 

then 0^=5” G L(^k,u) for all + 2) + <Pk{m) < u < Yn=oi'^k{'i) + 

2 ) + + 1 ). 

- If n = Z)i=d(^fe(*) + 2) + ^fe(s) + 1, then a'=&" G L(fc,„,o), a''^” i L{k, u,o), if 

rr = Ei=d(^fc(*) + 2) + <Pk{s) + 2, then a'^}p ^ and a'=&" G L(fc,„,i) 

for all J2iZoi^k{i) + 2) + - 1) <u< I]i=o (^fe(*) + 2) + ^fc(s). 

In both cases, a^b'^ G Li^kuv) for all u < Yhl=o{'^k{i) + 2) + <l>k{s — 1), 
r;G{0,l}. 

If n > YZi=o{‘^k{T) + 2), then G for all u<j,vG {0, 1}. 

- d g L(^k, u,o), and d Ei=o'^‘^b)+i g L(^k,u,i) 

for all JZtZoi'^kii) + 2) + ^k{s -2) <u< JZtZoi'^k{i) + 2) + ^k{s - 1). In 
case s = 2, we interpret s — 3 as 0. Add not other elements to any language. 

Set = (L(fcj»)fcj>g]N,i;e{o,i}- As in the first part of the proof, we can show 
that witnesses the desired separation. 

In the case of conservative learning, Stephan |2Z| proved that LIM[A] C L/M[/C], 
and COV[A'] = LIM[A\ for every oracle A. Thus, for every oracle A, COV[A\ is 
contained in LIM\A\ which is contained in LIM\]C] = COV\TOT]. Moreover, 
we can draw the following corollary. 

Corollary 6. There exists C G EMON[TOT\ with C G LIM \ EMON[A] ^ 0 
for every oracle A with TOT A. 

Proof. The indexed family defined in Theorem Qis not in EMON[A] for any 
K.<t A <T TOT. However, £2/3 g LIM, since LIMprob{2/3) = LIM (cf. [□!). 
From Corollary El follows that Cic is L/M-identifiable but not monotonically 
identifiable by any OIM M[B] where /C B. Now we can easily define a “join” 
of £2/3 and £jc which is L/M-identifiable, but not identifiable by an oracle 
machine having access to any oracle A with T OT A. 

The indexed families defined in Theorem^ are not properly strong-monotonically 
identifiable with p = ^f2. The next theorem shows that there is an indexed family 
which is strong-monotonically identifiable with p = 1/2 but not conservatively 
or monotonically identifiable by any A-oracle machine where TOT A. 

Theorem 8. There exists an indexed family £ G ESMONproi,(l/2) with 

£ G ECOV[TOT] n EMON[TOT\. 

£ CCOV[A\ U CMON[A\ for every oracle A with TOT A. 
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Proof. Due to the lack of space we only give a sketch of the proof. Let S := 
{a,b,d}. Let ( , ): IN x IN — >■ IN an effective encoding of IN x IN. We define 
{,L(^k,j))k,jeTN in dependence of ifikij), k,j G IN. Let A: G IN. In Step 0 of the 
construction, we add to every language. In the n-th step of the construction, 
compute the least s G IN, s > 1, such that + 1) + ^fe(s) < n and 

J2t=o(^k{i) + 1) + + 1) > n. In case n yf Z)i=o^fe(*)> ^^d every subset 

of {a’^P\i < n} to £. In case n = J2Uo^k{i), add to 

L(fc_o)- Moreover, add |j < s}, to every language j yf 0, 

with L(^k,j) C {a'^b^i=o‘^'‘^^^\y < g}. Then £ := {L^k,j))kj(^-iN fulfils the desired 
conditions. Notice that £ ^ EMONprob{2 / 3), since EMONprob{2 / 3) C LIM , 
and LIM = COV[K]. 
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Abstract. The present work investigates to which extent semantical know- 
ledge can support the learning of basic mathematical concepts. The considered 
learning criteria are learning characteristic or enumerable indices for languages 
from positive data where the learner has to converge either syntactically (Ex) 
or semantically (BC). The considered classes are the classes of all monoids of 
a given group, all ideals of a given ring or all subspaces of a given vector space. 
The following is shown: 

(a) Learnability depends much on the amount of semantic knowledge gi- 
ven at the synthesis of the learner where this knowledge is represented by 
programs for the algebraic operations, codes for prominent elements of the 
algebraic structure (like 0 and 1 in fields) and certain parameters (like the 
dimension of finite dimensional vector spaces). For several natural examples 
good knowledge of the semantics may enable to keep ordinal mind change bo- 
unds while restricted knowledge may either allow only BC-convergence or even 
not permit learnability at all. 

(b) A recursive commutative ring is Noetherian iff the class of its ide- 
als is BC-learnable. Such a BC-learner can be synthesized from programs for 
addition and multiplication. In many Noetherian rings, one can Ex-learn cha- 
racteristic indices for the ideals with an ordinal bound on the number of mind 
changes. But there are also some Noetherian rings where it is impossible to 
Ex-learn the ideals or to learn characteristic indices for them. 



1 Introduction 

The topic of the present work is to study the learnability of mathematical struc- 
tures, in particular the learnability of the classes of all ideals, monoids or similar 
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subsets within a ring, group or field. A special emphasis is placed on the infiuence 
of semantic knowledge, that is, which kinds of knowledge about the algebraic 
structure (for example, programs to compute the algebraic operations or codes 
for prominent elements) are necessary to synthesize the learning algorithm. 

The modelling of semantic knowledge is one major problem in many appli- 
cations of artificial intelligence. An automatic translator searching for the Ja- 
panese translation of the English word “brother” has four choices: “otbto” (my 
younger brother), “otbtosan” (your younger brother), “ani” (my older brother) 
and “onTsan” (your older brother). For finding the correct choice, the translator 
might have to hunt in the whole text for hints, for example, the author himself 
is visiting a secondary school while his brother is studying at the university so 
that the word “ani” is correct. This semantical knowledge is the main difficulty 
for automatic translation, syntactic grammatical rules are easier to deal with. 

In mathematics, semantical properties of structures are given by the operati- 
ons on a group or ring. Also the dimension of a vector space or a basis for it may 
serve as such semantical knowledge. The underlying mathematical structures in- 
terplay with the learnability with respect to the three aspects: the existence of a 
learning algorithm, the quality of the best possible learning algorithm in terms 
of convergence and the amount of knowledge necessary to synthesize a learner. 

All learnable classes considered in this paper satisfy the analogue of Noe- 
ther’s chain condition as well as the fact that the intersection of two sets is again 
a set in the class to be learned. So it is possible to learn every such class ha- 
ving a uniform decision procedure in contrast to the general case, where 

such classes are sometimes not learnable and, if one considers only the learna- 
ble classes, then it is impossible to synthesize a learner from a program for the 
uniform decision procedure m- Also the class of ideals in a Noetherian ring 
has either limited learning quality (BC) or a learner which converges on every 
sequence of data, even if it does not belong to a set within the given class — so 
the semantical constraints imply that there are no rings of intermediate learning 
quality like for example the class of all finite sets. This class is learnable with 
finitely many syntactical mind changes but not by a learner which also converges 
on the illegal data sequences. Furthermore, the general learning algorithm for 
Noetherian rings outputs only guesses which enumerate the elements of the ideal 
to be learned. Although Baur |S| showed that every ideal of a Noetherian ring is 
recursive, there are still Noetherian rings where it is impossible to learn decision 
procedures. 

Basic Algebraic Definitions. Before defining the algebraic structures, the 
reader should note, that within the present work, the operations -I- and • are 
always taken to be commutative, that is a+b = b+a and a-b = b-a for all a, b. This 
is not essential for many theorems, but it makes proof and argumentation easier 
and helps also the reader to follow the theorems and proofs. Also the operations 
-I- and • are always associative: a -I- (6-1- c) = (a -I- 6) -|- c. Now further restrictions 
have to be introduced in order to define groups, rings and fields within this 
framework of structures with commutative and associative operations. 

A group (G, -I-) has a neutral element, always denoted by 0, such that a-|-0 = 
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a for all a G G. This element is unique. Furthermore, there is for every element 
a the element —a such that their sum equals 0. Quite prominent groups are 
the group (Z, +) of integers and (Q, +) of rationals. A monoid A is a subset of 
a group which contains 0 and is closed under +: whenever a,b G A then also 
a + b G A. The natural numbers (N, +) form a monoid where N = {0, 1,2,.. .}. 
A subgroup (A, +) is a set which is closed under + and contains for every a also 
the inverse —a. For example, ({..., —4, —2, 0, 2,4,.. .}, +) is a subgroup of the 
integers. 

In a ring (R, + , •), the substructure (R, +) is a group with a neutral additive 
element 0. Furthermore, there is a multiplicative neutral element 1 and the 
distributive law has to be satisfied: a{b + c) = ab + ac for all a, b, c. An ideal is a 
subset A of a ring such that a + b G A for all a,b G A, —a G A for all a S A and 
a- 6 G A for all a G R and b G A. Adding the multiplication, the integers (Z, +, •) 
and rationals (Q,+ , •) are rings. The ideals and the subgroups coincide in the 
case of the integers, so every ideal there has the form {. . . , —2a, —a, 0, a, 2a, . . .} 
for some a. 

A field (F, +, •) is a ring with the additional property, that every a yf 0 has a 
multiplicative inverse b such that ab = 1. So a field is a ring where (F — {0}, •) is 
a group. The rationals and also the reals are examples for fields, but the integers 
are not a field since there is no integer b such that 2-b = 1. Note that fields have 
only the two trivial ideals: {0} and the whole set F. But subfields and subrings 
may be nontrivial structures. 

A vector space (V,+,-) over some field, say (Q,+,-), is a group (V,+) 
which in addition has a multiplicative operation. For all elements a,a' G~V and 
b,b' G Q it holds that a ■ {b ■ b') = {a ■ h) ■ b' , a ■ {b + b') = a ■ b + a ■ b' and 
(a + a') ■ b = a ■ b + a' ■ b. 

Further information on algebraic definitions can be found in textbooks like 
those of Cohen |^, Eisenbud 0 and Kaplansky ISI- 

Recursion Theoretic Notation. A set A can be represented in two ways in 
recursion theory: by (a) a grammar or a program which generates every element 
in A but which does not give any information on the non-elements of the set and 
(b) a program which computes the characteristic function x — >■ A{x) of a set — 
A(x) = 1 if X is in the set and A(x) = 0 otherwise. A program e which generates 
the elements of A is called an enumerable index for A, a program which compu- 
tes the characteristic function of A is called a characteristic index for A. 

Within this paper, the coding of the mathematical structures is always done 
by indexing them with natural numbers. So the additive and multiplicative ope- 
rations -|- and • within the structures are always also operations on the codes in 
N. The coding is furthermore 1-1 except in the last chapter which deals with co- 
ding where the equality is enumerable or the domain is not a proper non-recursive 
subset of N. Since the operations inside the mathematical objects are given by 
programs and not as abstract operations which one builds into the programs 
as oracle calls, one might expect that the information obtained is a bit more 
than in the model of Blum, Shub and Smale 0. Nevertheless the results of the 
present work could also be executed by a machine doing these ring operations as 
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oracle calls; much more information can be exploited from the knowledge of the 
semantic structure of the underlying objects than from the syntactic structure 
of the given programs. But presenting them as programs has the advantage that 
all operations within this work can be dealt within the uniform framework of 
recursive functions, although complexity theoretic aspects with respect to com- 
putation time or space (where oracle calls would have the cost 1) are lost since 
the learners have to deal with every, sometimes very inefficient, program for the 
algebraic operations. 

More information on the theory of recursive and enumerable sets can be 
found in the books of Odifreddi and Soare m- 

Learning Theoretic Notation. Learning in the present work follows Gold 
style learning of languages in the limit POT. A language is generally an enu- 
merable set, within this paper it is mostly a recursive set. The learning procedure 
receives a text which is an arbitrary sequence containing all elements of the set 
but no non-element. For each such text T and each finite prefix cr ^ T of the 
text, the learner produces an output M{a) which is a guess for a program to 
compute the characteristic function of the set to be learned. The most general 
notion of convergence is BC ( “behaviourally correct”): it just requires that M 
learns a language L iff M outputs for every text T oi L and almost every a ^ T 
an index M{a) oi L. A whole class S is learnable under a given criterion like BC 
iff there is a recursive learner which learns every L G S from every text under 
this criterion. 

The programs output by a BC-learner may all be different and since it is im- 
possible to check the equality of programs, one cannot see whether momentary 
changes of the hypothesis give really some improved program or just rephrases 
the previous program. So one favours a learner, which identifies the languages 
syntactically. The underlying criterion is called Ex (“explanatory”) and most 
natural BC-learnable classes, but not all, are also Ex-learnable. Formally, a ma- 
chine M Ex-learns a language L iff M outputs on every text T oi L for almost 
all cr ^ T the same program ct for L. Note that every such algorithm can trans- 
lated into an equivalent one which learns the same languages and in addition 
converges on every text to the same program for L. 

A more restrictive variant is learning with a bounded number of mind chan- 
ges when the learner may output only a constant number of different hypotheses 
among which the last one is correct. Freivalds and Smith m introduced the 
more general notion of bounding mind changes by ordinals: Here the learner has 
to count down an ordinal at every mind change and when the ordinal reaches 0, 
no further mind change is possible. For practical purposes it is often sufficient 
to consider ordinals which can be expressed as polynomials in oj with positive 
integer coefficients. For example, the numbers 1, 3, w -I- 2, a; -|- 5, 2w -I- 3, -I- 4 

and are such ordinals. The class containing 0 and all sets = {n, n-l- 1, . . .} 
is a quite natural example of a class learnable with the ordinal mind change 
bound Lu but not with any constant mind change bound. 

Since there is no infinite descending sequence of ordinals, it is clear, that 
a learner with an ordinal bound on the number of mind changes converges on 
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every input text — even if the language generated by it is not enumerable and 
therefore definitely not learnable. Ambainis, Jain and Sharma 0 showed that 
also the converse holds: If a machine M learns a class S of languages and con- 
verges on every text whatever it comes from, then there is a recursive ordinal a 
such that S can be learned with an a bound on mind changes. 



Learning with Semantic Knowledge. Adleman and Blum Q showed that 
the semantic knowledge which programs are total and which not allowed to learn 
all recursive functions in the limit. In their approach it is even sufficient to get 
this information in the limit. Formally this is done by using high oracles: Adle- 
man and Blum PJ showed that exactly the high oracles allow to learn the class 
REC of all recursive functions in the limit. 

Semantic knowledge is in the present work mainly knowledge on some mathe- 
matical structures linked to the languages to be learned. For example, if ideals 
within a ring are learned, the semantic knowledge might consist in programs 
which compute the ring operations. These programs may be i for the addition 
and j for the multiplication. Of course the learning algorithm depends on i and 
j: so given i and j, one has first to synthesize the learner which then learns the 
ideal using some semantic knowledge also derived from i and j. Osherson, Stob 
and Weinstein I2ni showed that synthesis can be quite difficult: there is no ef- 
fective procedure which synthesizes an Ex-learner for the finite class {We, We'} 
from grammars e and e' generating these sets. In the case of learning uniformly 
recursive families of languages or functions, synthesis is quite more powerful 
mm- Note that synthesizing a learner via a recursive function and having 
a learner with parameters, which is correct for every fixed legal value of these 
parameters, is the same concept — both can be transformed into each other. 
This holds in an abstract manner for all parameterized recursion theoretic pro- 
cedures and recursion theorists deal with it as “substitution” or an application 
of the “S'™ -Theorem” ^21 Proposition II. 1.7]. 

In the setting of learning uniformly recursive (or indexed) families of langu- 
ages, the (quite restricted) semantics are present in form of a program |3I28| . In 
the case of learning functions by enumeration, it is quite obvious how to syn- 
thesize a learner from this information. On the other hand, this is not possible 
in the case of language learning from positive data. Here one exploits semantic 
knowledge about the family which cannot be obtained algorithmically from the 
decision procedure. Kobayashi and Yokomori jl t)p21)l2 ij obtained for these fami- 
lies results parallel to those obtained for BC-learning within the present work. 
Since for uniformly recursive families, the notions of syntactic and behavioural 
convergence coincide, Kobayashi and Yokomori P3 state there results for Ex- 
learning. Later, they IZH investigated the learnability of certain classes of regular 
languages and to which degree this learnability is preserved under the formation 
of subclasses and the application of homomorphisms. 

The topic of the present work is to find connections between the ability to 
learn or to synthesize learners on one hand and the access to semantic know- 
ledge on the class to be learned on the other hand. The classes to be learned 
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are mainly substructures (ideals, subgroups, subrings, subspaces) of prominent 
mathematical objects. 



2 The Ring of Integers 

Within this section it is investigated to which extent it is possible to learn 
subsets of the integers which satisfy certain natural requirements. These subsets 
are either ideals within the ring (Z,+,-) or at least monoids, that is, closed 
under +. It is shown that learnability depends very much on the fact to which 
extent semantical information is accessible on the present coding of the natural 
numbers. 

There are direct encodings of the integers into the natural numbers such that 
all operations (addition, negation and multiplication) are easily computable and 
codes for prominent numbers like 0 and 1 are known to the learner. The next 
theorem shows how one can learn in this standard model the classes of all ideals 
and monoids and gives optimal bounds on the mind change complexity which 
can be achieved. Later variants are considered where less information is present 
and therefore some of the specific semantic of the integers is lost. It is then shown 
that either the complexity of the learning process goes up or learning becomes 
impossible at all. This loss of semantics makes it necessary to distinguish between 
a number x and the code ax representing it. Nevertheless relations and operations 
can also be carried out on the codes, so x + y stands just for the z which satisfies 

ax ay . 

Theorem 2.1 For the standard model, the class of the ideals of (Z, +, •) can be 
learned from positive data with mind change complexity u> and the class of the 
monoids o/(Z,+) with mind change complexity These hounds are optimal. 

The next theorem analyzes under which circumstances one can still recover all 
necessary informations on Z in order to learn with optimal mind change bounds. 

Theorem 2.2 Using the below information on (Z, +, •) one can synthesize a 
machine which Ex-learns characteristic indices from positive data and satisfies 
the ordinal mind change hounds uj for ideals and up' for monoids: 

(a) a program for a 1-1 mapping to the standard model; 

(b) a program for the addition, the code for 0 and the code for 1; 

(c) a program for the addition and a program for the multiplication. 

A uniform decision procedure for a class of sets is a mapping i,x ^ Ui{x) such 
that the sets Ui cover all sets in the class and such that every Ui is in the given 
class. Such a class has the characteristic sample property, if for every set Ui 
there is a finite subset Ei such that Ei C Uj ^ Ui C Uj for all j. Kobayashi and 
Yokomori showed that classes which have a uniform decision procedure 

and the characteristic sample property are Ex-learnable. The proof is effective 
in the sense that a program for the mapping i,x — >■ Ui{x) can be uniformly 
translated into a program for the learner. 
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Fact 2.3 piTH) Assume that for a uniformly recursive family Ui there is a family 
Ei of finite sets such that, for all j, Ei C Uj Ui Q Uj. Then it is possible to 
synthesize an Ex.-learner which converges on every text of some Uj to the least 
i with Ui = Uj. Namely, this learner assigns to input every a the first i < |cr| 
such that range{u) C Ui and there are no j,x < |cr| with range{a) C Uj and 
X € Ui — Uj ; if such an i is not found, the algorithm outputs the symbol 

A notation of ordinals assigns to every code one fixed ordinal such that there 
ordering is computable. An easy way to represent notations of ordinals is to 
identify them with enumerable and well-ordered set O C Q. Recall that a set 
is well-ordered iff there is no infinite descending sequence go > gi > ... of 
elements of this set. Usually notations of ordinals are also equipped with further 
operations to detect limit ordinals, successors and so on — but the negative 
result of the next theorem is obtained by diagonalizing against all well-ordered 
enumerable sets of rationals and therefore, these additional structures can just 
be ignored. 

Theorem 2.4 It is possible to synthesize learners for the class of all ideals or 
all monoids in (Z,-|-,-) from the following data: 

(a) a program for an uniform decision procedure for the class; 

(b) a program for the addition. 

The obtained machines learn the given class from positive data and satisfy some 
ordinal mind change bound. But there is no fixed notation of ordinals such that 
the synthesized learner can succeed by bounding its mind changes with respect to 
this notation. 

The loss of the concrete ordinal bound is mainly due to the fact, that knowing 
the addition alone does not enable to identify the code for 1. The next result 
shows that it is even worse not to know the 0: then learning is impossible at all. 
Let G be a copy of Z on which a translation / : G x Z — )> G which assigns to 
every code a and every integer n the “u-th neighbour” of a. Such an operation 
is still near to the addition, but it covers every incidence which number is 0. 
Therefore the learner has not only to learn the monoids but also to learn every 
structure which cannot be distinguished from a monoid like for example the set 
corresponding to {— 2, — 1, 0, 1, 2, . . .} in G. This makes it impossible to learn 
monoids. 

Theorem 2.5 Let f : G X Z — >■ G &e a translation on the copy G = g(Z) of 
the integers via some unknown bijection g, that is, f{g{x),y) = g{x + y) for all 
x,y G Z. Then it is impossible to synthesize any machine which BC-learns the 
class of monoids in G from positive data where the monoids in G are just the 
sets of the form g{A) for monoids A in Z. 



3 Noetherian Rings 

Noether m studied rings without infinite ascending chains of ideals. She cha- 
racterized these rings as those where all ideals are generated by a finite subset 
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of their elements. Due to this and further results, these rings were named after 
her. The next result gives a further characterization of the recursive Noetherian 
rings among all rings with recursive ring-operations: they are the rings whose 
class of ideals is learnable. 

Theorem 3.1 Let (R, -I-, •) be a ring with recursive ring operations and let S be 
the class of all ideals in this ring. Then the following statements are equivalent: 

(a) (R,-h, •) is Noetherian; 

(b) Enumerable indices for S can be JiC-learned from positive data; 

(c) A machine, which HG-learns enumerable indices for S from positive data, 
can be synthesized from programs for + and ■ . 

Proof For (a c), let Mi j assign to every string a the ideal generated by 
range{a) using the additive operation -|- given by the index i and the multiplica- 
tive operation • given by the index j. For every cr and every ideal / containing all 
elements of range(a) it holds that outputs some set enumerating some 

ideal J such that range{a) C J C /. If the learner has enough seen of some text 
for I, then all elements of a finite set generating I have already shown up in the 
text and J = /. So Mij BC-learns enumerable indices for ideals in (R, -I-, •). 

For (c b), observe that (c) requires effective synthesis of a learner while (b) 
requires only its existence. 

For (b => a) and let M be an BC-learner for S. For every ideal I, M has a 
locking sequence a such that range{a) C I. Now let J be the ideal generated by 
the finite set range{a), clearly J C I. For every r £ J* it follows that M^ar) 
outputs an index for I. Since M also learns J it follows that I = J and so I is 
finitely generated. The ring (R, -I-, •) is Noetherian. | 

Applications Baur 0 obtained some of his results in a very general setting 
where he did not consider concrete ideals generated by finite sets E but only 
an abstract hull operation in a countable universe which assigns to every E the 
hull I{E) in the sense that I{I{E)) = I{E), I{E) C I{E') whenever E Cl E' 
and I{E) = U„/({a:i, CC 2 , . . . , a;„}) if if = {xi,X 2 , . . .} is infinite. Within such a 
setting, he defined Noetherian hull operations as those where every set E has 
a finite subset E' such that I{E) = I{E'). For these structures one can show 
that corresponding versions of the Theorems 13.11 and 13.31 hold; so they hold in 
particular for learning subgroups or monoids within a basic group (G,-|-). 

Considering Angluin’s model of uniformly computable families jS|, Kobayashi 
and Yokomori na Theorems 11 and 12] obtained a general result which implies 
a parallel theorem for rings whose ideals are uniformly recursive. They used 
only the abstract property of the monoids and ideals that they are closed under 
infinite unions. Note, that in the world of uniformly recursive families there is 
no difference between BC-learning and Ex- learning. 

Theorem 3.2 m Let S be a uniformly recursive family closed under infinite 
union. Then S is Ex-learnable from text iff S is Noetherian (in the sense that 
there are no infinite ascending chains of sets in S). 
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So there are rings where BC-learning can be improved to Ex-learning, namely 
those where S is uniformly recursive. The next result states that whenever there 
is an improvement then this improvement is a major one: Given any algorithm 
which can either Ex-learn enumerable indices or BC-learn characteristic indices, 
one can find effectively a better an Ex-learner which outputs characteristic indi- 
ces and which has some ordinal mind change bound. So Noetherian rings have 
either bad or good learning qualities, but nothing in between. The intuitive rea- 
son is, that all Noetherian rings are “almost good” learnable and any ring whose 
structure is a bit helpful for learning, has already a good learning algorithm. So 
it takes a quite involved construction for getting a Noetherian ring which has 
only bad learning quality. 

Theorem 3.3 Let S be the class of all ideals of (R, -|-,-) with recursive ring 
operations + and ■. Then S can be Tix-learned with an ordinal bound on the 
number of mind changes if one of the following conditions is satisfied: 

(a) Enumerable indices for S are Ex-learnable; 

(b) Characteristic indices for S are HC -learnable. 

Proof (a): Fulk mi showed that every Ex-learner for a class of languages can 
be modified such that the learner on every text for every language on the class 
ends up in a locking sequence. Given such a learner M , it will be transformed 
within two stages to a learner N satisfying the desired requirements. 

First characteristic indices /(cr) based on the information a already seen is 
computed. / may be faulty on some strings cr but is defined such that it is total 
and correct on sufficient long prefixes cr A T for texts T of languages L G S. 
Note that there is an effective way to generate for every enumerable set We the 
ideal I {We) generated by We- Now one defines the following program associated 
to a by interpreting the behaviour of the recursive learner M : 

, , J 1 if a; is enumerated into I{Wm{ct))'-i 

Tf(a){x) “ I 0 if there is r G I{WM(a) U {a^})* such that M{ar) fy M{a). 

If both conditions are satisfied, then the output is an arbitrary one, if none are 
defined, then the function is undefined at x. Given some ideal L, some text T 
for L and some locking sequence a < T then f{a) is a total index for L: If x G L 
then X will be enumerated into WM(a) and thus also into I(WM(a })7 so the 1-case 
above is defined. Also the 0-case does not occur since also I{WM{a)C{x}) equals 
to L and cr is a locking sequence for L.li x ^ L, then x will not be enumerated 
into I{WM(a)) which equals L since cr is a locking sequence. Furthermore, the 
ideal I{WM(a) U { 2 :}) is different from L and since M infers it, there is some r G 
I(Wm{g) U { 2 :})* such that M makes a mind change to M{ar). So g}f(„){x) = 0 
in this case. 

Furthermore, one can transform any text T into a text g{T) for the ideal I{T) 
generated by range{T). This transformation only uses the build-in operations 
-|- and • of the ring and is independent of the learner M. One can realize the 
transformation such that the elements at the even positions of g{T) are — just 
a bit delayed — those of T while those at the odd positions are generated from 
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some enumeration of all elements contained in the ideal of T and pasted between 
the original elements. The procedure to find these elements to be pasted in is 
just to check at every stage s which is the first number which on one hand had 
not occurred within the first s — 1 elements of g(T) and which on the other hand 
is enumerated into the ideal generated by T within s steps. Thus one obtains 
that every text T is translated into some text g{T) for the ideal generated by the 
elements of T. The translation is effective and the function g is also defined on 
strings by assigning to every a the first 2\a\ elements of g(T) which depend only 
on (T. Now both functions / and g are combined to give the desired learner N. 

N{a) = /(r) for the first t ^ g{a) with Mirrj) = M(r) for all g £ 
range{g{a))* of length up to 2|cr| — |r|. 

It is easy to see that N picks up in the limit the value /(r) for some locking 
sequence r of g(T). Furthermore, since M converges on every g(T), N converges 
on every text T, for whatever set it is, and thus N satisfies some ordinal mind 
change bound |2|. 

(b): Again one takes a learner which has on every text a locking-sequence; this 
time M BC-learns characteristic indices and for each text T of some L G S there 
is a CT ^ T such that M{aT) is an index of the characteristic function of L for 
all T G L* . The function g to translate the texts is the same as in part (a) but 
the function / has to be adapted: 

, , J 1 if a: £ I{range{a)); 

0 if ^m{<jt){x) = 0 for some r £ I{range{a))* . 

If both cases are defined, then the program /(r) takes just the first one to occur. 
Now the new learner N combines / and g such that 

N{a) = /(r) for the first r ^ g{cr) such that there is no a; £ range{g{a)) 
with i^y(.r).|<T|(a;)i= 0. 

The verification is similar to the previous case, with the main difference, that 
/(tj) is total for every a since the whole behaviour of M on texts beginning with 
g{<j) on the ideal I{range{a)) is analized. Note that by definition, N translates 
any text T into the text g{T) on which M has to converge and that thus N 
converges on every text, so some ordinal mind change bound is kept. | 

A ring is Artinian if every descending chain of ideals is finite. Artinian rings are 
also Noetherian. Baur jSl Theorem 3.8] showed that the class of all ideals of an 
Artinian ring with recursive operations -I- and • is a uniformly recursive family. 

Given a ring (R, -I-, •) and a symbol x, one can define the ring (R[a;], -k, •) of 
all polynomials in the variable x over (R, where -I- and • are extended to 
this new domain by the well-known algorithms to add and multiply polynomials. 
By the Hilbert Basis Theorem, the ring of the polynomials in one variable x is 
also Noetherian if the given original ring is Noetherian. Therefore, whenever it 
is possible to BC-learn enumerable indices for the ideals of some ring (R, -I-, •)> 
then the same can be done for (R[a::i, X 2 , ■ ■ ■ , Xn], + > •)• Furthermore, Hermann 
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showed that for every recursive field (F, +, •) the class of ideals in the ring 
{F[xi,X 2 , ■ • ■ , Xn],+, •) is a uniformly recursive family. So by Theorem 13.31 (c), 
the class of ideals of both rings can be Ex-learned. 

Corollary 3.4 The class S of all ideals in a given ring can he Fx-learned from 
positive data, if the ring is Artinian or of the form {F[x\,X 2 , • ■ • , Xn],+, •) for 
some recursive field (F,-|-,-). 

While in Theorem 13.31 conditions (a) and (b) are also necessary, this is not 
longer for condition (c). Baur |5j showed that every ideal in a Noetherian ring is 
recursive, but also gives an example of a recursive Noetherian ring, whose ideals 
are not uniformly recursive in the sense, that every uniformly recursive family 
containing all ideals also contains some other sets. Therefore the algorithm from 
Fact 12.31 does not work here. 

The next two constructions use localization |TKl Section 1-4] . If a set iL C R 
is closed under multiplication and contains 1 but does not contain 0 then the 
localized ring (R//, -I-, •) is given by Rh = ■ n G H,m G R} where ^ ^ 

iff k(mn' - m'n) = 0 for some k G H , ^ ^ and ^ 

It easy to see that by operating on pairs and taking an enumerable set H, 
one obtains again a ring with an representation on which the ring operations are 
recursive. The definition of the equality gives that the equality is enumerable in 
the sense that the set {(a, h) G Rh x R^f : a = 6} is enumerable. But Baur ^ 
Satz 3.4] showed that, for every Noetherian ring with enumerable equality, the 
equality is in fact already recursive. 

Example 3.5 Let P be an enumerable but not recursive set of prime numbers 
and let H be the multiplicative closure of P. Then the ring (Zh,+,-) has a 
recursive representation, the class of its ideals are not uniformly recursive in 
any of its recursive representations and an Rx-learner with mind change bound 
uj can be synthesized from programs for + and ■. 

While BC-learners outputting enumerable indices can be synthesized from pro- 
grams for -|- and •, this is not longer possible for Ex-learners or BC-learners ou- 
tputting characteristic indices, even in rings of the form (Q[a;i, X 2 , . . . , a;„], -I-, •)• 
Since the constructions in Theorem fOI are effective, it is sufficient to show that 
it is impossible to synthesize an Ex-learner outputting characteristic indices. 

Theorem 3.6 Let i and j be indices for the operations + and ■ such that 
they define a ring either isomorphic to (Q[a;], -I-, •) or to (Q[a;, y], -I-, •)• Then it 
is impossible to synthesize from given i and j a machine, which Rx-learns the 
characteristic function for some default ideal of this ring from positive data. 

The method to construct this example can be adapted to construct a Noetherian 
ring such that its ideals cannot be learned from text under the criterion Ex. 
Recall that any Ex-learner outputting enumerable indices for the ideals can 
be transformed into learners outputting characteristic indices so that one can 
without loss of generality consider Ex- learner of the latter type. 
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Theorem 3.7 There is a Noetherian ring such that the class of its ideals is not 
TiiX-learnable. 

No algebraic characterization of the Ex-learnable rings is known and it is quite 
probable that there is no nice one. Nevertheless, one can give a recursion- 
theoretic characterization: The class S of all ideals of a Noetherian ring is Ex- 
learnable iff there is a iti-recursive set B containing characteristic indices of 
exactly the sets in S. 

The omitted proof of Theorem ITTI showed that whenever M is a Ex-learner 
for S then M has high Turing degree, that is, K' can be computed in the limit 
using M as an oracle. The following theorem shows that this complexity is also 
sufficient to establish an Ex-learner for the ideals of all Noetherian rings which 
gets programs for -|- and • as parameters. Stated in the terminology of Stephan 
and Terwijn m one has that, for learning the ideals of Noetherian rings, there 
exist universal BC-learners in all and universal Ex-learners exactly in the high 
Turing degrees. 

Theorem 3.8 Exactly the high Turing degrees allow to compute an Ex-learner 
for every Noetherian ring, this Ex-learner can be synthesized from programs for 
the ring operations -\- and ■. 

The next theorem looks at the concrete class of all ideals within the Noetherian 
ring (Q[a::i, X2, . . . , Xn],+, •)• Note that by Hermann’s result fSl already the class 
of all ideals is learnable since it is uniformly recursive; so the accent lies more 
on the complexity of the learning process. The learner knows the dimension n, 
programs for -|- and • as well as codes for the important elements Q, 1, x\, X 2 , ■ ■ ■, 
Xn in advance. It is shown that w" is the optimal ordinal mind change bound 
under these circumstances. 

Theorem 3.9 The class S of all ideals in {Q[xi,X 2 , ■■■ ,Xn],+,-) can be learned 
with mind change bound w" from positive data but not with any mind change 
bound a < w" . 



4 The Field of Rational Numbers 

Learning suitable classes of subsets of fields exploit two operations, the addition 
-|- and the multiplication • in the field. There may be learning situations, where 
only one operation is given. So the next two theorems deal with the situation, 
where the semantic knowhow of the learner is a program for one operation and 
the other one should be learned from data. The next result shows that the 
multiplication can be learned if the addition is known. 

Theorem 4.1 It is possible to synthesize a finite learner from an index e of the 
addition in the field (Q, -I-, •) which learns a program for the multiplication from 
a stream of data consisting of exactly the tuples (x, y, z) with z = x ■ y. 

The converse direction to learn the addition from the multiplication is not possi- 
ble, in particular because there are too many ways to define a recursive additive 
operation compatible with the given multiplication. 
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Theorem 4.2 Given the standard encoding of the rationals and the multipli- 
cation ■, the standard addition is not the only additive operation which is com- 
patible with ■. Indeed there are uncountably many additive operations, each of 
them defining an isomorphic copy of the field (Q,+,-). It is impossible to learn 
the recursive ones among these additive operations in the limit from the data 
provided by all tuples (ax,ay,Ux + «y)- 

So one has for the field of rationals (and also for the ring of integers Z and 
the monoid of natural numbers N), that it is easy to learn the multiplication 
if the addition is known while the opposite direction is impossible. This result 
has in the real world the parallel, that pupils learn in school first how to add 
natural numbers and then how to multiply them since they would face much 
more difficulties to do it the other way round — the abstract “non-learnability” 
does of course not hold in such a strict sense in the real world but “difficult” is 
an appropriate term to describe the situation. 

The (omitted) proof of Theorem 14 . 1 i highlv depends on the semantic know- 
ledge over the field of the rationals. Already for the ring (Q[x], -b, •) it is impos- 
sible to learn the multiplication from the addition. 

Finite dimensional vector spaces of the rational numbers are quite common 
in mathematics, in particular in number theory. The next result deals with the 
question, which semantic information is necessary to learn the subspaces of finite 
dimensional vector spaces over the rational numbers from positive data. 

Theorem 4.3 Let (V,-|-,-) be a k-dimensional rational vector space and k be 
finite. It is possible to synthesize a machine Ihx-learning all linear subspaces from 
a program for the addition -\- and the dimension k but it is impossible to do this 
from a program for -\- alone without having any information on the dimension k. 

One can even show that for some family of the vector spaces with dimension at 
most 2 it is impossible to synthesize an Ex-learner for the class of the subspaces 
of this vector space from an index of the addition. So if it would be possible to 
generate learners which satisfy an ordinal bound on the mind changes via some 
fixed notation of ordinals, one could effectively produce the union of the learners 
Me_o and Mep in the construction and obtain an Ex- learner for both cases. Ho- 
wever such a learner does not exist and it is impossible to synthesize a learner 
from an index of the addition and the dimension of the space which respects 
ordinal mind change bounds with respect to a fixed notation of ordinals. 

A typical example for finite dimensional vector spaces over the rational num- 
bers is the superfield generated by adding a variable x which represents \/2. Such 
finite dimensional superfields are called number fields of finite degree and have 
an additional structure, namely the multiplication. The next results shows, that 
in number fields of finite degree, programs for the addition and multiplication 
are sufficient to generated a learner for the class of their vector subspaces. 

Theorem 4.4 It is possible to synthesize, from programs for addition and multi- 
plication, a learner which Ihx-learns from positive data the characteristic function 
of any subspace of a number field of finite degree. 
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5 Models Without Decidable Equality 

Within the previous sections, there was always a 1-1 coding of the objects re- 
presented. This assumption is not so natural as one thinks of: in an acceptable 
numbering, every function has infinitely many codes (programs) and further- 
more, it is undecidable which programs are equal and which not. In model theory, 
mathematicians study many models for whose representation the equality is un- 
decidable. The next theorem provides also an example of this type and shows, 
that the learnability for the two most natural representations is different. In such 
a model, an learner M learns a set L iff it converges to enumerable indices of 
the set of all codes of elements of L. 

Theorem 5.1 There is a group (G,-|-) having two representations: 

(a) The addition + is recursive hut the equality = only enumerable and not 
recursive; 

(b) The equality is recursive but the addition not. 

Now in case (a) there is a JiC-learner but no TN-leamer for the class S of all 
finitely generated monoids in G, in case (b) S can be TN-learned. 

So this example showed that learnability depends much on the representation. 
The further dichotomy that the first representation has only enumerable indices 
while the second can give characteristic ones is more due to the model itself than 
to learning theory since none of the non-empty sets in S in the first representation 
is recursive. It is not possible to have an equivalent example for Noetherian rings 
since whenever the ring operations are recursive and the set {(a, a') : a = a'} is 
enumerable, then the equality is also recursive 0 Satz 3.4]. 
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Abstract. This paper describes the design of the inductive logic pro- 
gramming system Lime. Instead of employing a greedy covering approach 
to constructing clauses, Lime employs a Bayesian heuristic to evaluate 
logic programs as hypotheses. 

The notion of a simple clause is introduced. These sets of literals may be 
viewed as subparts of clauses that are effectively independent in terms 
of variables used. Instead of growing a clause one literal at a time. Lime 
efficiently combines simple clauses to construct a set of gainful candidate 
clauses. Subsets of these candidate clauses are evaluated via the Bayesian 
heuristic to find the final hypothesis. 

Details of the algorithms and data structures of Lime are discussed. 
Lime’s handling of recursive logic programs is also described. 
Experimental results to illustrate how Lime achieves its design goals of 
better noise handling, learning from fixed set of examples (and from only 
positive data), and of learning recursive logic programs are provided. 
Experimental results comparing Lime with EOIL and PROGOL in the 
KRK domain in the presence of noise are presented. It is also shown that 
the already good noise handling performance of Lime further improves 
when learning recursive definitions in the presence of noise. 



1 Introduction 

This paper is a progress report on Lime — an inductive logic programming (ILP) 
system that induces logic programs as hypotheses from ground factsQ Unlike 
many systems (e.g., Quinlan’s FOIL [23], Muggleton and Feng’s GOLEM [19], 
and Muggleton’s Progol [21] that employ a greedy covering approach to construc- 
ting the hypothesis one clause at a time. Lime employs a Bayesian framework 
that evaluates logic programs as candidate hypotheses. This framework, intro- 
duced in [14] and incorporated in the Lime system, has been shown to have the 
following features: 

— better noise handling, 
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^ This is a preliminary report of work in progress; updated versions of the report on 
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— ability to learn from fixed example size (i.e., there is no implicit assumption 
that the distribution of examples received by the learner matches the true 
“proportion” of the underlying concept to the instance space), 

— capability of learning from only positive data0, and 

— improved ability to learn predicates with recursive definitions. 

Empirical evidence was provided for the effectiveness of this framework with 
respect to the above four criteria in [14] . The present paper describes the design 
of the Lime system. Since the Bayesian heuristic employed in Lime requires 
evaluation of entire logic programs as hypotheses, the search space is naturally 
huge. This paper explains how Lime exploits the structure in the hypothesis 
space to tame the combinatorial explosion in the search space. 

The main idea of the design of Lime is that instead of growing a clause one 
literal at a time, it builds candidate clauses from “groups of literals” referred 
to as “simple clauses” . These simple clauses can be very efficiently combined to 
form new clauses in such a way that the coverage of a clause is the intersection 
of the coverage of the simple clauses from which it is formed. Once a list of 
candidate clauses has been constructed, the Bayesian heuristic is used to evaluate 
its subsets as potential hypotheses. 

Structurally, Lime has four distinct stages: 

— structural decomposition (preprocessing of the background knowledge), 

— construction of simple clauses, 

— construction of clauses, and 

— search for the final hypothesis. 

Within each stage care is taken that no redundant information is passed to the 
next stage. A diagrammatic outline of the various phases of Lime is given in 
Figure n 



1.1 Related Work 

We refer the reader for preliminaries and notation about ILP to the book by 
Nienhuys-Cheng and de Wolf [22] or the article by Muggleton and De Raedt 
[18]. Other source books for ILP are Bergadano and Gunetti [3], Lavrac and 
Dzeroski [12], and Muggleton [17]. Below, we briefly touch upon work that is 
related to ours. 

The design of Lime is somewhat reminiscent of the approach of translating an 
ILP problem into a propositional one. Dzeroski, Muggleton, and Russell [10] de- 
scribe the transformation of determinate ILP problems into propositional form. 
The systems LINUS and DINUS by Lavrac and Dzeroski [12] employ attribute 
value learners after transforming restricted versions of the ILP problem. 

Kietz and Liibe [11] introduced the notion of /c-local clauses which is so- 
mewhat similar to simple clauses. They divided a clause into determinate and 

^ The system can also learn from only negative data, but this capability is diminished 
when the concept requires a recursive definition. 
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Background Examples 




Fig. 1. The partitioning of Lime’s stages. 



nondeterminate part and further subdivided the nondeterminate part into k- 
local parts. They were motivated by a desire to find efficient algorithms for 
subsumption. 

As already noted Lime considers entire logic programs as hypotheses instead 
of building the hypothesis one clause at a time. Another system that follows this 
approach is TRACY by Bergadano and Gunetti [2]. During the preprocessing 
phase of the background knowledge, Lime automatically extracts type and mode 
information. Similar issues are addressed by Morik et al [16] in their system 
MOBAL. 

1.2 Outline of the Paper 

The outline of the paper is as follows. In Section 2, we introduce the noise model 
and the Bayesian framework employed in Lime. In Section 3, we describe the 
hypothesis language of Lime and discuss the notion of simple clauses in some 
detail. Sections 4-9 are devoted to a detailed discussion of the system design 
of Lime. Preprocessing of the background knowledge is discussed in Section 4. 
Section 5 describes the construction of simple clauses. Later sections describe 
the construction of clauses and the search for the final hypothesis. Finally, in 
Section 10 we report on experiments with Lime. 

2 Noise Model and the Bayesian Heuristic 

In this section we describe our framework for modeling learning from data of 
fixed example size with noise. Within this framework, we derive a Bayesian 
heuristic for the optimal hypothesis. 

Let X denote a countable class of instances. Let Dx be a distribution on the 
instance space X. Let C C 2^ be a countable concept class. Let Dc represent 
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Fig. 2. Model of positive example generation 



the distribution on C. Let H he & hypothesis space and P be the distribution 
(prior) over H. The concept represented by a hypothesis h G H is referred to as 
the extension of h (written: ext(/i)). Further, let C and H be such that: 

— for each C G C, there is an h G H such that C = ext(h); and 

- for each C G C, Dc{C) = Y.{ hGW|c=ext(ii) } ^(^)- 

Let 9{C) denote the “proportion” of the concept C with respect to the instance 
space X, that is, 0{C) = J2xec 

We assume that a concept C is chosen with the distribution Dc- Let e G [0,1] 
be the level of noise. Suppose we want to generate m positive examples and n 
negative examples (the reader should note that in the fixed example model, m 
and n are independent of the concept C). 

Each of the m positive examples are generated as follows: With probability 
e, a instance is randomly choose from X and made a positive example (this 
could possibly introduce noise). With probability 1 — e, an instance is repeatedly 
selected randomly from X until the instance is an element of the concept. This 
instance is the positive example generated. Figure 2 illustrates this process. The 
generation of negative examples is done similarljlH 

A fixed example size framework allows learning to take place from only po- 
sitive data (and from only negative data) in addition to the usual combination 
of positive and negative data. Additionally, the choice of such a framework can 
be motivated as follows. Many learning systems have an implicit expectation 
that the distribution of examples received by the learner matches the true “pro- 
portion” of the underlying concept to the instance space. However, in many 
situations such an assumption is unjustified. Usually the size of positive and 
negative examples is fixed and independent of the concept being learned. As an 
example, consider a learner presented with a set of 100 positive and 100 nega- 
tive examples of cancer patients. It is very unlikely that this set of examples is 
representative of the population from where the examples are drawn. 

® The level of noise e can be made different for the positive and negative examples, 
but for simplicity we take it to be the same. 
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We now derive a Bayesian heuristic for finding the most probable hypothesis 
h given the example set if 0 This induction can be formally expressed as follows 0 

induced = “ax P(/r|S) (1) 



Using Bayes’ formula, P{h\E) can be expressed as follows. 

mE) = 

We will apply Occam’s razor in computation of P(h), the prior probability of 
the hypothesis h, thereby assigning higher probabilities to simpler hypotheses. 
P{E\h), probability of examples E given that hypothesis h represents the target 
concept, can be calculated by taking the product of the conditional probabilities 
of the positive and negative example sets. As each positive example is generated 
independently, P{E~^\h) may be calculated by taking the product of the conditio- 
nal probabilities of each positive example. P^{e\h), the conditional probability 
of a positive example e given hypothesis h, is computed as follows. 



P+{e\h) 



+ Dx{e)e, if e S ext(/i); 
Dx{e)e, if e ^ ext{h). 



( 3 ) 



A few words about the above equation are in order. Given that h represents 
the target concept, the only way in which e ^ ext(/i) is if the right hand path in 
Figure El was chosen. Hence, in this case the conditional probability of e given 
h is Dx{e)e. On the other hand, if e G ext{h) then either the left or right hand 
paths in Figure El could have been chosen. The contribution of the right hand 
path to P'^{e\h) is then Dx{e)e. If the left hand path is taken, then the instance 
drawn is guaranteed to be from the target concept; hence Dx{e){l — e) is divided 
by 9{ext{h )) — the proportion of the target concept to the instance space. By a 
similar reasoning we compute P~{e\h), the conditional probability of a negative 
example e given hypothesis h. 



p-{e\h) 



f-ffitV)) + Dx{e)e, if e ^ ext(/i); 

Dx{e)e, if e G ext{h). 



( 4 ) 



Now, P{E\h) can be computed as follows. 



P(E\h) = n P'^(e\h) n P~(e\h) (5) 

e£E+ eeE- 



We let TP denote the set of true positives { e G A+ | e G ext{h) }; TN denote 
the set of true negatives { e G E~ \ e ^ ext{h) }; FPN denote the set of false 
positives and false negatives, { e G A+ | e ^ ext (ft,) } U { e G E~ \ e G ext (ft) }. 

^ All references to example sets are actually references to example multisets. E is the 
union of positive (A"*") and negative {E~) examples. 

® The notation max^gif P(ft|E) denotes a hypothesis h £ H such that (Vft' G 
H)[P{h\E) > P{h'\E)]. 
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Substituting0and0into|niand using TP, TN, and FPN, we get the following. 



P{E\h) 



n 

\e£E+UE^ / 



1 — e 
1 — 0(ext(/i)) y 



1 - e 
0(ext(/i)) 

\ |TN| 



|TP| 



JFPNI 



( 6 ) 



Now substituting into |3 and |3 into Q and performing additional arithmetic 
manipulation, we obtain the final ^induced max/jg// Q(/i), where Q(Ji) is 
defined as follows. 



OM = ig(PM) + |TP|ig(j(^ + .) + 

Hence, in our inductive framework, a learning system attempts to maximize Q(/i) 
(referred to as the quality of the hypothesis h). Lime evaluates logic programs 
as candidate hypotheses by computing their Q values. The details of how Lime 
computes P{h) and 9{ext{h)) are provided in Sections 8 and 9, respectively. 

Finally, we would like to note that the treatment of noise in our framework 
has some similarities to that of Angluin and Laird Q . Their noise level parameter 
measures the percentage of data with the incorrect sign, that is, elements of the 
concept being mislabeled as negative data and vice versa. In their model 50% 
noise level means the data is truly random, whereas in our model truly random 
data is at noise level of 100%. Thus, in their model it is not useful to consider 
noise levels of greater than 50%. Our current model requires that the noise level 
be provided to the system. Although this may appear to be a weakness, in 
practice, a reasonable estimate suffices, and it can be shown that with increase 
in the example size, the impact of an inaccurate noise estimate diminishes. It 
should be noted that experiments reported in this paper always used a noise 
parameter of 10% in computing Q{h) even if the actual noise in the data was 
considerably higher. 



3 Hypothesis Space and Simple Clauses 

The hypothesis space of Lime is chosen not only to reduce the size of the search 
but to also simplify the structure of the search space. The hypothesis space of 
Lime consists of definite ordered logic programs whose clauses are: 

— function free: this simplifies the structure of each clause without seriously 
affecting expressiveness. 
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— determinat^: although this restricts expressiveness, clause handling is sim- 
plified as each variable may only be bound in a unique way, and 

— terms in the head of the target clause are required to be distinct variables, 
and terms in literals of the body can only be variables (this restriction can be 
overcome by introducing predicates that define a constant and the equality 
relation in the background knowledge). 

We next motivate the notion of simple clauses. 

Each clause consists of a head literal and a list of body literals. In general, 
a literal within the body of a clause may use a variable that has been intro- 
duced and bound by a previous literal in the body. Since variables within the 
clause overlap, this literal can’t be considered independent. However, the list 
of literals in the body may be broken down into lists of literals that are effec- 
tively independent in terms of variables used. We refer to such lists as simple 
clauses. This notion is best illustrated with the help of an example. Consider 
the predicate teen_age_boy where the background knowledge consists of each 
person’s age, each person’s sex, and the greater than relation. Now a claus^ for 
teen_age_boy is: 

teen_age_boy(H) •<— male(A), age(A, B), B > 12, 20 > B. 

The above clause can be “constructed” from the following three simple clauses 
by “combining” their bodies. 

teen_age_boy(H) male (A). 
teen_age_boy(H) age(A,B),B > 12. 
teen_age_boy(H) ^ age(A, B),20 > B. 

To see how the above works and to formally define a simple clause, we intro- 
duce the notion of a directed graph associated with an ordered clause. For any 
two literals l\ and I 2 in a clause, we say that I 2 is directly dependent on li just in 
case there exists a variable in I 2 that is bound in Zi. Hence, a directed graph may 
be associated with a clause by associating each literal to a node in the graph and 
by forming a directed edge from the node associated with literal l\ to I 2 just in 
case I 2 is directly dependent on Zi. A literal I 2 is said to be dependent on literal 
Zi just in case there is a path in the graph from l\ to Z 2 . 

Clearly, graphs associated with determinate clauses are acyclic. A literal in a 
clause is said to be a source literal just in case there are no literals in the clause 
on which it depends. A literal in a clause is said to be a sink literal just in case 
there are no literals in the clause that depend on it. Clearly, the head of a clause 
is always a source literal. 

® Intuitively, a clause is determinate if each of its literals are determinate; a literal is 
determinate if each of its variables that do not occur in previous literals has only one 
possible binding given the bindings of its variables that appear in previous literals. 
See Dzeroski, Muggleton, and Russell m for a more formal definition. 

^ Lime will of course use a slightly different representation as it does not allow con- 
stants; this example has been chosen to illustrate the notion of a simple clause. 
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Definition 1. A clause is said to be simple just in case it contains at most one 
sink literal. 

The directed graph for the clause defining teen_age_boy is show in Figure 3. 
Since it has three sink nodes, it is not a simple clause. However, the directed 
graph for the clause describing overl2, shown in Figure 4, has exactly one sink 
node and hence is a simple clause. 



teen_age_boy (A) male (A) , age(A,B), B>12 , 20>B. 




overl2 (A) ◄— age(A,B) , B>12 . 




Fig. 4. Directed graph corresponding to the over twelve clauses 



We next discuss some properties of simple clauses which makes them suitable 
for our needs. Since we are considering determinate clauses, the ordering of the 
literals in the body is important; the body of a clause is considered a list, not 
a set. Although there is some flexibility in the ordering, the constraint that “if 
l\ depends on I2 then li is to the right of I2' must be respected. This constraint 
yields an equivalence class over the set of clauses. If C is the set of all clauses, 
then we define [c] to denote the set of all clauses in C that are equivalent to c. As 
these partitions are finite, and there is a lexicographic ordering over the clauses, 
a unique clause may be used to represent the equivalence partition, which we 
take to be the clause in [c] with the least lexicographic ordering. 

Let bi and 62 be the bodies of two clauses with the same head. Then h\ l±l 62 
denotes the concatenation of 61 with 63, where 63 is 62 with all the bi literals 
removed. Consistency in variable naming is carefully maintained as follows: 

— variables in the heads of the two clauses are the same; 
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~ if there is a literal in b\ and another in 62 with the same predicate symbol 
and the same variables that have been bound in previous literals, then the 
new variables in these two literals have the same variable names. A simple 
way of maintaining this consistency in variable naming is to name variables 
by the simple clauses that created them. 

We also adopt the notation l±l for clauses. Let c\ = h\ ^ b\ and C2 = ^ 62- 

Then c\ l±l C2 is ft-i •<— 61 l±l 62 if hi = /12 and undefined otherwise. 

Simple clauses have three useful properties. First, any clause may be construc- 
ted from a finite set of simple clauses. Second, the intersection of the coverage of 
a set of simple clauses is the coverage of the clause formed by combining the set 
of simple clauses. The third property is about the completeness of the method 
of constructing simple clauses; i.e., there is an algorithm that enumerates the 
complete set of simple clauses for a given hypothesis language. 

The next three propositions formalize these properties. Let C be the set of all 
clauses in the hypothesis space and let S be the set of all simple clauses. Clearly 
SdC. 

Proposition 1. For all c € C, there exists a finite set of simple clauses S such 
that c G [Wsgss]. 

Proof. Let c = h b and let g be the graph associated with c. Then for each 
sink literal in g we construct a simple clause by including all the literals that 
this sink literal is dependent on. Let {h ^ b\,h ^ b 2 , - ■ ■ ,h ^ 6„} be the set of 
simple clauses thus formed. 

We claim that c G [/i &i l±l 62 W • • • W &n] • Clearly, 61 W 62 W • • • l±l will not 
contain any literal not in c as each literal in the body of the simple clauses is 
from c. Also, the combined clause, 61 l±l 62 W • • • W will not be missing 
any literal from c because each literal in c is either a sink literal or has at least 
one sink literal dependent on it. In the former case the literal will be found in 
the simple clause formed by the sink literal; in the latter case the literal will be 
found in the corresponding simple clause. □ 

We next show that the coverage of a clause can be calculated by taking 
the intersection of the coverage of its simple clauses. This is because in the 
case of determinate clauses, the variable bindings for the simple clauses and 
the corresponding combined clause match. Hence, given the same interpretation 
prescribed by the background knowledge, if all the literals in the combined clause 
are true then all the literals in the simple clauses will also be true, and vice versa. 

Proposition 2. Let c € C be given. Let the set of simple clauses {si, S 2 , . . . , s„} 
he such that c = si l±l S 2 W • • • W Sn- Then the coverage of c is the intersection of 
the coverage of each simple clause in {si, S 2 , . . . , s„}. 

The above property yields a very efficient method of calculating coverage of a 
clause — by taking the conjunction of coverage bit vectors associated with the 
simple clauses that are combined to form the clause. 
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Proof. Suppose c G C covers an instance e. When e is resolved with c, each 
variable in c is uniquely bound (since c is determinate) in such a way that each 
literal in the body of c is true in the interpretation implied by the background 
knowledge. As each simple clause contains both a sink literal and all the literals 
that this sink literal depends on, the binding for each variable in the simple 
clause will be the same as the binding in c. Hence, each literal in the body of the 
simple clause will also be true in the intended interpretation. Thus, each simple 
clause also covers the instance e. 

Suppose an instance e is covered by each simple clause in the {si, S2, • ■ • > Sn}- 
Then for each Sj there is a unique variable binding (since each Si is also determi- 
nate) (Ti that witnesses coverage of e by s^. Moreover, each literal in the body of 
Siai is true in the intended interpretation. Now, the same variables may appear 
in different simple clauses. It is easy to argue that when e is resolved with diffe- 
rent simple clauses the binding for a variable appearing in these simple clauses 
is the same (if this was not the case then we will have a contradiction to the 
assumption that c is determinate). Hence, when the bodies of the simple clauses 
are combined the bindings for variables across all the simple clauses will be the 
same. Therefore, each literal in c will also be true in the interpretation. Hence, 
c covers e. □ 

Proposition 3. There exists an algorithm that enumerates the complete set of 
simple clauses. 

Proof. We first discuss the idea behind such an algorithm. The graph associated 
with a simple clause contains one sink literal which directly or indirectly depends 
on all other literals in the clause. Now if this sink literal is removed from the 
simple clause, we get a clause that has one or more sink literals. For each sink 
literal in this new clause, a new simple clause may be created by including the 
sink literal and all the literals that the sink literal depends on. Each new clause 
thus formed is simple and is smaller than the original simple clause. Reversing 
this process, it is easy to see that any simple clause may be created by combining 
a set of smaller simple clauses with a new literal I in such a way that the newly 
formed clause has I as the only sink literal. The one exception to this property is 
the simple clause with empty body. A complete algorithm for enumerating the 
simple clauses follows directly from this property, and such an algorithm forms 
the basis of Lime’s simple clause table construction. 

Algorithm 1 Simple Clause Enumeration. 

begin 

current-simple -clause := {h .} 

output h G- . 

loop do 

N:={} 

foreach S C currentsimple-clause do 
foreach I G possible-literals do 
sc := combine I with S 
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if sc is a new simple clause then 
output sc 
N := iVU{sc} 

fi 

od 

od 

current^simple-clause := current^simplc-clause U N 

od 

end 



We now show that Algorithm ^enumerates the complete set of simple clauses. 
The proof is by induction: we show that after the Tth iteration of the loop all 
simple clauses with up to i literals in their bodies have been enumerated. 

Clearly this is the case for i = 0 as the only simple clause with 0 literals in 
its body is the clause with empty body. 

Now suppose the inductive hypothesis is true for i = k. Then all simple 
clauses with k or fewer literals in their bodies will be in ^ curent^simplc-clause’ . 
After the next iteration of the loop all simple clauses with fc + 1 literals in their 
bodies will have been enumerated. This is because any simple clause with k + 1 
literals can be formed by a set of simple clauses with at most k literals in their 
bodies, and a new literal. Hence, the inductive proposition is true for i = fc + 1. 
Also note that the number of new simple clauses in each iteration of the loop is 
finite, as the number of both possible subsets and new literals are finite. □ 

The above algorithm is clearly very inefficient in the way it forms simple clau- 
ses — in each iteration it repeatedly considers simple clauses that have already 
been formed, and it also must detect and remove clauses that are not simple. 
By requiring new simple clauses to contain a variable from the previous level, 
repetition in the algorithm is removed; and by maintaining the literal depen- 
dency information, the non-simple clauses may be avoided. These techniques are 
employed by Lime. 



4 Preprocessing the Background Knowledge 

The first stage in Lime’s inductive process involves the preprocessing of the 
background knowledge. This phase has three goals: 

— automatic extraction of information from the background knowledge to en- 
able the system to dynamically direct the search, 

— removal of any redundancy within the background knowledge, and 

— encoding of the background knowledge so that it may be efficiently indexed 
for the search. 

We briefly discuss these aspects of the preprocessing phase. 
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4.1 Extracting Type and Mode Information 

Each term in a predicate has an implicit type. A clause in which the type as- 
sociated with a variable is inconsistent will not form part of a good hypothesis. 
Hence, by learning the type information from the examples and the background 
knowledge, inconsistent clauses may be skipped in the search. Lime uses a flat 
type hierarchy. Integer and floating point types are inferred simply from the syn- 
tax. Other types are induced from the example and the background knowledge. 
This process is thoroughly examined in [13]. 

Also, since the search space is restricted to determinate clauses, it is useful 
to know the mode restrictions for each predicate prior to the search. In the 
absence of such information each time a literal is added to a clause the system 
would need to assert that unbound variables are uniquely bound. As this check 
is essentially the same each time it is conducted, considerable improvement in 
performance can be achieved if mode information was available. Lime extracts 
mode information from the data which enables it to skip clauses that are not 
determinate. This process is also detailed in [13]. Another system that addresses 
these issues is MOBAL [16]. 

4.2 Removing Redundancy 

There are three ways in which redundancy is removed from the background 
knowledge. First, if a set of relations are equivalent then only one needs to 
be considered in the inductive process. For this purpose two relations are said 
to be equivalent if they consist of identical ground facts in such a way that 
the predicate name and the ordering of the terms in the predicate are ignored. 
Second, if there exists symmetry within the terms in a relation then it is only 
necessary to consider one ordering of the terms. Consider the add relation which 
is symmetric in the first two terms. If the variables in the first two terms of an 
add literal in the body of a clause are flipped the new clause will be equivalent 
with respect to its coverage and size. Hence, only one of the clauses need be 
considered. This is illustrated in the two mult clauses shown below. Although, 
they are different syntactically, they may be considered equivalent, and hence 
only one needs to be considered in the search space. 

mult(H, H, C) •<— inc{D, A), B, E),a.dd{E, B,C). 

mult(H, H, (7) ±nc{D, A), nmlt(D, B, E),&dd{B, E,C). 

Third, as only determinate logic programs are considered, any background rela- 
tions that do not produce a determinate clause are not considered in the search 
space. 

4.3 Improving Indexing 

Once the background knowledge is preprocessed it is used either to determine 
a ground query or a query that uniquely binds new variables. Due to the large 
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number of queries to be performed, this operation must be efficient. Hence, hash 
tables are use to store the background knowledge. The time complexity of a 
query operation is 0(1) with respect to the number of ground facts defining the 
background predicate. 



5 Simple Clause Table Construction 

After the background knowledge is preprocessed, a table of candidate simple 
clauses is constructed. The simple clause table is the central data structure in 
Lime. Care is taken that there are no repetitions in this table. Also, as the table 
is constructed a record is maintained of both the instances each clause covers and 
the binding of each variable for different instances. This makes the process more 
efficient as the variable bindings need not be recalculated each time a simple 
clause is extended to form a new simple clause. The simple clause table of Lime 
may be viewed as consisting of two tables: 

— Simple Clause Coverage Table: The first part of the table contains informa- 
tion about the coverage of instances by candidate simple clauses. This part 
of the table is stored as a bit vector: 1 indicating that the simple clause 
covers the instance and 0 indicating that the simple clause does not cover 
the instance H The advantage of this storage scheme is that the coverage 
of a clause formed by combination of two simple clauses can be very effi- 
ciently determined by taking the conjunction of the bit-vectors describing 
the coverage of the two simple clauses. 

— Variable Binding Table: The second part of the table consists of the binding 
information for each variable introduced in the simple clauses. Since we are 
concerned here with only determinate clauses, these bindings are unique. We 
use X to represent the fact that a variable does not have a binding for the 
instance. 

An example of Lime’s simple clause table is shown in Figure 5. This table 
captures the snapshot when Lime is in the process of learning the add relation. 
The two tables. Simple Clause Coverage Table and the Variable Binding Table, 
are shown side by side. It can be seen that the base clause for the add relation 
may be formed by combining the simple clauses 30 and 33. The conjunction 
of the bit-vectors for the these clauses will give the coverage of the following 
definite clause: 



add(yO,Hl,y2) : -zero(I^O), equal(l/0, H2). 

The above clause once formed can be disjuncted with the clause consisting 
of just one simple clause (no. 500) to form the complete definition of the add 

® In addition to 0 and 1, we also use the don’t care state, denoted here as X, to indicate 
that a better simple clause exists (hence, the coverage information is not tabulated, 
although the simple clause possibly produces a useful new variable, and hence is 
kept). 
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Fig. 5. Tables constructed in the simple clause stage for add 
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relation. The reader should note that disjunction of the bit vectors for the simple 
clause 500 with the conjunction of the bit vectors for 30 and 33 covers all the 
positive instances and none of the negative instances. 

The simple clause table acts as an intermediate stage between the enhanced 
background knowledge and the candidate clauses. In many respects this inter- 
mediate stage is redundant, as the system could generate the candidate clauses 
directly from the preprocessed background knowledge. However, there are at 
least four efficiency reasons for doing so: removal of redundancy, introduction 
of memorization, removal of paths from search space that are not fruitful, and 
consolidating the construction of new variables in a clause at an initial stage. 

First, we discuss the reduction of redundancy. Suppose a clause ci = ft. ^ 
h,h,h,l 4 - consists of three simple clauses: 



hi — 
ft •<— I3. 
ft i — I4. . 



By constructing the simple clauses first and then forming ci by combining them. 
Cl is only considered only once. However, if the system constructed Ci literal by 
literal, the same clause Ci many be considered many times as outlined below. 



ft i — 1 1 . ft i — /l , ^2 • ft ^ ; ^2 : ^3 ■ ft ^ — ^1; ^2 7 ^4- 

ft 4— ^2- ft ^2) ^3- ft ^2 j ^3)^1- ft ^2 j^3j^1)^4- 

ft i — 1 4 . ft i — ^4 , ^3 . ft i — ^4 , ^3 , ^2 ■ ft ^ — ^4 ; ^3 1 ^2 ? ■ 



However, the above redundancy could be eliminated without the intermediate 
stage. One way to do this would be to place a syntactic ordering on the literals, 
and adhering to this ordering in considering clauses. However, this introduces 
its own problem: the syntactic ordering may not be the most gainful path in 
constructing a clause thereby making the use of a gain heuristic in the search 
less effective. 

Second, the simple clause table introduces memorization in the system. Lime 
records the coverage of simple clauses, hence each time the literals of a simple 
clause are considered within a clause, the coverage does not need to be recalcu- 
lated, it is simply looked up in the simple clause table. This is also the case for 
the variable bindings. 

Third, the simple clause table provides a mechanism for removing entire 
branches of the search space. Lime only records simple clauses that are not 
dead-ends, thus eliminating the search of clauses that are not potentially useful 
in the final hypothesis. This is best illustrated with an example. Consider two 
literals h and h- When these literals are combined in a clause, the clause covers 
no positive or negative example^ Any clause that contained these literal would 

® The clause should also cover no examples used in the theta estimation (see Section 9). 

Otherwise, this constraint would fail when the system is learning from only negative 
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be a dead-end. Hence, by identifying li and I2, and not considering them, a whole 
series of ‘dead-ends’ are removed. 

Fourth, by separating the clause generation into two stages, induction of sim- 
ple clauses followed by induction of candidate clauses, all aspects of generating 
new variables in a clause are assigned to the first stage. This makes the latter 
stage more efficient as it is not concerned with generating new variables and 
maintaining their bindings. 

5.1 Algorithm for Simple Clause Coverage Table and Variable 
Binding Table 

We now present the algorithm for constructing the two tables. The algorithm 
maintains three data structures: a list of simple clauses, a list of bit vectors 
representing coverage of the simple clauses, and a table of variable bindings 
for the variables used in the simple clauses. These structures are initialized to 
contain just the simple clause with an empty body. Then candidate literals are 
created by considering each predicate symbol in the preprocessed background 
knowledge with all possible variable bindings generated until now and some new 
variable provided certain conditions are satisfied. A syntactic ordering is used 
to label variables to avoid considering literals which are equivalent to the ones 
already considered. If the coverage of a clause is different from the coverage of 
clauses generated until now or if the clause introduces a new variable, then it 
is incorporated into the data structure. To ensure that no simple clauses are 
repeated, the new literal must contain at least one variable from the previous 
cycle through the background relations. Note, some simple clauses produce new 
variables that are of use, but their instance coverage is of no value, in which 
case the new variables are recorded, but not the bit vector. The variable binding 
table not only maintains the variable binding of each variable for each instance, 
it also maintains an index of the simple clause the variable was generated in, and 
also maintains the level at which the variable was created. The task is complete 
if no new simple clauses are added in a given iteration, or if one of the tables is 
full. The detailed algorithm follows: 

Algorithm 2 Simple Clause Table Construction. 

Input : 

BG - preprocessed background knowledge 
Output : 

C - A list of simple clauses 

BV ~ A hit vector table that stores coverage information for simple clauses 
VT - A variable binding table for each variable used in a simple clause 

begin 

C := h ^ {} /* hifi the head of the target predicatedefinition */ 

examples, as the target hypothesis would in general cover no negative examples and 
clearly no positive examples. Hence, the system would consider the target hypothesis 
a dead-end. 
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Add to BV the bit vector for the empty body clause 
Add to VT the variable bindings for the empty body clause 
added := true 
level := 0 
while added do 
added := false 

foreach predicate symbol P in BG do 

foreach Possible variable assignment a for P 

with at least one variable from the previous level 
lit := literal formed by P and a 

clause := simple clause formed by adding lit to simple clauses 
from C that originate variables from a 
vector := generate bit vector from clause 
binding -Vectors := compute the binding vector for 
each new variable in literal 
if vector not in BV then 

/* This simple clause is not equivalent to a previous one */ 

Add clause to C 

Add vector to BV 

Add ( binding -Vectors, level) to VT 

added := true 

else if binding-vectors may be useful then 
Add clause to C 

Add {binding-vectors, level) to VT 
added := true 

fi 

if any table full then 
break while loop 
fi 
od 
od 

level := level + 1 

od 

output C,BV,VT 
end 



6 Clause Table Construction 

The simple clauses may now be combined to form a table of candidate clauses. 
Clearly, this has to be done efficiently as there are 2^ candidate clauses to consi- 
der for any k simple clauses. To this end Lime takes advantage of the following. 

— Simple clauses may be combined efficiently by conjunction of bit vectors. 

— When combining simple clauses, large branches of the search tree may be 
pruned without explicit search. 
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— The selection heuristic feeds back clear bounds on the required search. 

We next shed some light on the crucial aspects of the clause table construction 
algorithm, followed by the details of the actual algorithm. 

The algorithm is essentially depth-first search in nature^ At each node in 
the search tree the clause associated with the node is considered in conjunction 
with every simple clause. A gain heuristic is employed to direct which portion 
of the search space to consider. The resulting clauses are added to the best 
candidate clause list. There is a restriction on the number of candidate clauses 
maintained at any given time; only the best candidates added to the list are 
saved. Also, at each node in the search tree a list of the most gainful clauses 
are generated. Size of this list is also restricted and is dependent on the level 
in the search tree — deeper one goes in the search tree, smaller is the list of 
gainful clauses. This is because the heuristic is less accurate high up the search 
tree. Overlap is eliminated in the search by placing an ordering on the simple 
clauses, and requiring that they be considered in that order. This ensures that 
each simple clause is only considered once. 

Figure El shows a search tree with gairi-UstAength = 100, which is the default 
value. It should be noted how the branching decreases as the tree depth increases. 
Suppose the depth of the tree is 5, then if there was no restriction of the branching 
factor, there will be 10101010101 nodeJ^. However, by reducing the branching 
factor as the tree is descended, the number of nodes is reduced to 10101 nodes, 
thereby making the search feasible without significantly affecting the chance of 
finding optimal clauses. 

There are two stopping criteria for the search to bound the depth of each 
branch in the search tree. As simple clauses are combined the new clauses cover 
fewer examples. Thus, an obvious stopping criterion is when the clause covers 
no example. The other stopping criterion is when a node is reached such that 
the best possible clause that can be derived from descendents of this node are 
not good enough to make it to the candidate clause list. 

Finally, a few words on the gain heuristic employed to guide the search for 
simple clauses and the Bayesian heuristic employed to determine the most gainful 
clauses. Suppose old_clause and new-clause are two clauses with the obvious 
meaning. Then the gain heuristic for going from old_clause to new-clause is 
calculated as (ri-old— n_new) x (lg{p_new+2)) , where n_old denotes the size of the 
negative coverage of old^clause, n^new denotes the size of the negative coverage 
of new-clause, and p-new is the size of the positive coverage of new-clause. The 
calculation of the Bayesian heuristic for the most gainful clause is based on the 
idea of Q-heuristic from Section 2. 

The algorithm is given below. It is a standard implementation of the depth- 
first algorithm using the program stack recursively calling probe_simple_clause. 
Note that calc_gain_list_size calculates the number of branches in the search 
tree at a certain depth in the tree. Also, gain estimates the gain when a simple 

Earlier implementations have looked at best-first [5,24] and depth-bounded discre- 
pancy search [25], but depth-first was found to be the most effective. 

This is a decimal number 
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Each simple clause is combined' 
with an empty body clause. The 
100 most gainful clauses form 
the branches at level 0. 




Only the 10 most gainful 
clauses are considered 
at level 1. These clauses 
are formed by adding simple 
clauses, with an index greater 
than the simple clause added at 
the parent node, to the clause 
in that parent node . 




Only the 3 most 
gainful clauses 
make up the branches 
at level 2. And 
only 1 at lower 
levels . 



I 




Each node has a clause associated 
.with it. All possible simple clause 
additions to this clause are considered 
for the candidate clauses. 



Fig. 6. The search tree where gainJistJength = 100 



clause is combined with a clause to form a new clause. And best.bay estimates 
the best possible value for the Q-heuristic given this clause forms part of the 
final hypothesis. This best estimate is achieved by assuming the other clauses 
that make up the final hypothesis, which is a set of clauses, is as good as this 
clause in terms of positive cover, accuracy, and prior probability. 

A list of clauses’ data structure and associated functions has been created for 
this algorithm. This list is sorted by a value associated with each clause. Also, 
the cover of each clause is maintained in the form of a bit vector. For the gain 
list, the index of the last simple clause added to the clause is kept with each 
clause in the list. When a new list is created it is given a maximum size. If a 
list has reached its maximum size and a clause is added which has a value worse 
than the worst clause in the list, then the clause is ignored and the list is not 
altered. If a clause is added which has a value better than the worst clause in 
the list, then the worst clause is removed and the added clause is inserted in its 
appropriate place. 

Algorithm 3 Clause Construction from Simple Clauses. 

Input : 

S - Array of simple clauses and their coverage bit vectors 
/* Each element in S consists of the clause body */ 

/* and a bit vector giving the coverage of the simple clause */ 

Output : 

C - A set of candidate clauses. 

Parameters : 
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nurri-candidate-clauses /* The maximum number of candidate */ 

/* clauses to generate */ 

gainJistJength /* The length of the gain list used at level 0 */ 

begin 

foreach simple clause c G S do 

compute gain(simple clause with empty body, c) 

od 

Sort S in descending order of gain computed above 
Let C be a list of size nurri-candidate-clauses 
Let if be a record representing a new clause 
Initialize E as follows 
E.body := {} 

E.bitvector := all bits set to 1 
E .last-index := —1 

/* last-index is the index of the last */ 

/* simple clause in S used to form the new clause */ 
probe^imple clause(S', C, E, 0) 

output C 
end 

probe_simple_clause(6', C, R, L) 

Let GL be a new list of size calc_gain_list_size(L, 
foreach i G {i?.last Jndex + 1, i?.last Jndex + 2, . . . , [S'! — 1} do 
Let NC be a record representing a new clause 
Initialize NC as follows 
N e.body := S[i\.body l±l R.body 
NC .bitvector := S[i].bitvector A R.bitvector 
NC. last-index := i 

if {NC covers some elements) A {N C .bitvector ^ R.bitvector) then 
gain := gain(i?, NC) 

( extensions to NC could prove better than 
^ the worst clause in C) V {C list not full) 

Add {NC, gain) to the list GL 

fi 

value := bay_best(iVC') 

Add {NC, value) to the list C 

/* Of course, the above addition only takes place if there is */ 

/* space in C or the Bayesian estimate for NC */ 

/* is better than the clause with the worst estimate in O */ 

fi 

od 

foreach GE G GL do 
probe_simple_clause(S', C, GE, L + 1) 

od 



calc_gain_list_size(Zei;eZ, length) 
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f^-leveL 
return [length^ ’ 



J 



gain(oW_cZoMse, new-clause) 

ri-old := nega,tive-cover{old-clause) 
ri-new := negative jzover (new ^clause) 
P-uew := positive-COver (new-clause) 
return {ri-old— ri-new) x {lg{p_new+ 2)) 



best -bay ( clause) 

NN := number of negatives 
NP := number of positives 
TP := NP 

TN := NN — negative -Cover (clause) 

FP := 0 

FN := negative -Cover (clause) 

g NP 

positive_cover(c/aMse) 
theta := estimate_theta_cover(cZaMse) 

return TP x (lg( + noise))+ 

(NN -FNxS)x (lg( ^~y^f + noise))+ 
(FN xS)x (lg(nozse))+ 
prob( charts e) x S 



6.1 Other Approaches 

Earlier versions of Lime employed two other search strategies. However, they 
turned out to be less effective than depth-first search. The first of these was the 
best-first search. While this approach combines the advantages of both depth- 
first and breadth-first by extending the search tree at the node that appears to 
be the most promising, it also retains the main drawback of breadth-first search 
— excessive storage requirement. This may partly be overcome by maintaining 
a bounded set of nodes for exploration in the search tree. However, such an 
implementation tends to either not search deep enough, or to only search a 
small portion of the possible branches at the top of the search tree depending 
on the heuristic used for determining the most promising nodes. In either case 
the algorithm does not perform very well in many situations. For similar reasons 
a beam search like the one employed by Clark and Niblett [8] in CN2 is not as 
effective as depth search. 

The other approach attempted was a depth-bounded discrepancy search [25] . 
This simply re-orders a depth- first search, examining more probable clauses first. 
Hence, the search space is restricted earlier in the search as “good” clauses are 
found earlier. However, because this technique required either revisiting nodes 
or storing the search tree, the efficiency gain did not overcome the overhead of 
either revisiting nodes or storage. 
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6.2 Efficiency 

It is difficult to determine the exact actual size of the search tree as it is dyna- 
mically pruned. However, there is an upper bound on the size of the search tree. 

The number of nodes in the search tree depends on the gainJistsize. 

Let I = T— le ^ ^ — 1 and let b = qainJistsize. Then the maximum 

' ig gainJist.size ' ^ 

number of nodes in a tree of depth d will be 

l + h+bxh’^ ^ + bxb'^ ^ x b'^ ^ + h 

bx b^ ^ X ■■■ X b^ + {d — 1) X b X b'^ ^ x ■ ■ ■ x b^ 

The number of operations required at each node is dependent linearly on 
the number of examples for the bit vector operation times the number of simple 
clauses considered at the node. It should be noted that the bit vector operations 
can be performed efficiently using the system level bitwise operations provided 
by most architectures. The memory requirement is minimal as it is a depth 
search algorithm. 

7 Inducing the Final Hypothesis 

The final stage in Lime’s inductive process is similar to the previous stage, 
though with some crucial differences. First, the search is for a logic program 
from a set of candidate clauses. Second, instead of using conjunction of the 
coverage vectors of simple clauses, we use disjunction of the coverage vectors of 
clauses, as an instance may be covered by any one of the clauses. Third, as there 
may not always be independence between clauses (due to recursion), a Prolog 
interpreter is used to accurately evaluate the hypothesis cover. Fourth, as the 
Prolog interpreter used has no backtracking, order is important in the list of 
clauses induced. The details are given in the following algorithm. 

Algorithm 4 Induction of Final Hypotheses from Candidate Clauses. 

Input : 

C - array of clauses. / * Each element in the array consists of the clause * / 

/* and a bit vector giving the coverage of the clause. */ 

Output : 

H - A set of induced hypotheses. 

Parameters : 

num-finalJiypothesis /* The maximum number of hypotheses induced */ 
gainJistJength / * The length of the gain list used at level 0 */ 

begin 

Let H he a, list of size num-finalJiypothesis 

Let L be a record representing a new logic program 

Initialize L as follows 

L. clauses := {} 

L.bitvector := all bits set to 0 
probe_clause(C', iL, L, 0) 

output C 
end 
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probe_clause(C', H, R, L) 

Let GL be a new list of size calc_gain_list_size(L, 

/* calc_gain_list_size is defined in Algorithm 0 * / 
foreach i G {0 . . . |C| — 1} do 

Let NL be a record representing a new logic program 
Initialize NL as follows 
AL. clauses := {C[i]. clause} U R. clauses 
AL.bitvector := C[i\.hitvectorW R.bitvector 

if {NL covers some elements) A (AL.bitvector ^ A.bitvector) then 
gain := gain_lp(i?, AL) 

( extensions to NL could prove better 
than the worst clause in H) V {H list not full) 

Add {NL,gain) to the list GL 



if 



then 



fi 



value := bay(AL) 

if value better than worst value in H then 

Update AL.bitvector using Prolog interpreter and NL. clauses 
/* If AL. clauses is recursive then AL.bitvector (and value) */ 

/ * are only estimates; therefore, the Prolog interpreter is used * / 
/* to compute the actual coverage and value. 
value := bay(AL) 

Add (AU, value) to the list H 
fi 



fi 



od 

foreach GE G GL do 
probe_clause(C', H, GE, L + 1) 

od 

gain_lp{old-clause, new_clause) 
n_old := negative_cover(oZ(i_cZaMse) 

P-old := positive_cover( oW_cZflMse) 
njnew := negative_cover(new_cZoMse) 

P-uew := positWe-COver (new-clause) 

return {lg{p_new+ 2)) x (2 x {u-old— n.new) + {p-new— p_old)) 

hay (logic-program) 

NN := number of negatives 

NP := number of positives 

TP := positive_cover( Zo^ic-pro^ram) 

TN := NN — negative -Cover (clause) 

PP := NP — positive-cover(logic-program) 

PN := negative -Cover (logicjprogram) 
theta := estimate-th.eta-COver{logicjprogram) 
return TP x + noise))+ 

TAx(lg(i-||f + no*se))+ 

(PP+PN) X (lg(noise))+ 
prob ( logic-program) 
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7.1 Recursive Logic Programs 

In most cases each clause in a hypothesis may be considered independently. This 
allows the clauses to be be induced individually and then combined to form the 
final hypothesis. This is the approach taken in ILP systems employing a greedy 
covering strategy (e.g., ITTHTI L Unfortunately, the independence of each clause 
breaks down when recursion is involved because a recursive clause by itself will 
not cover any examples. 

The common approach in learning recursive clauses is to include all the po- 
sitive examples into the background knowledge. This allows the algorithm to 
determine coverage by using these facts to unify with the recursive literals in 
the body of the clause. However, this introduces many problems, different sy- 
stems handle them in a variety of ways. For example, FOILig induces a partial 
ordering on the constants, and then requires at least one term in the recursive 
literal to descend or ascend this ordering. This ensures that there are no loops 
in the recursive call. CHILLIN^Tj, on the other hand, requires that at least one 
term in the recursive literal be a proper sub-term of the corresponding term in 
the head of this clause. This ensures that the clause induced does not lead to 
infinite recursion. 

Since, in the induction of the final hypothesis Lime considers entire logic 
programs, a Prolog interpreter is used to accurately determine the coverage of 
potential hypotheses. This step, without including the positive examples into the 
background knowledge, weeds out any poorly constructed recursive hypotheses. 
For example, if a recursive logic program is missing its base case, it will quickly 
be shown to cover no examples, and hence give a poor posterior probability. 
However, it should be noted that when Lime needs to evaluate coverage of 
simple clauses or clauses in previous stages, it behaves like other ILP systems 
and estimates the coverage of an individual recursive clause by including the 
positive examples in the background knowledge. However, to address the problem 
of infinite recursion, it does not directly restrict undesirable clauses, rather it 
constructs a graph of how the positive examples recursively use each other. 
This approach enables a better estimate of a clause’s final coverage as part of a 
complete logic program. 

This process is best explained with an example. Consider the recursive clause 
add{A,B,C) ^ add{B,A,C). We wish to estimate its coverage when it forms 
part of a complete hypothesis, of which we do not know the other clauses. Also 
suppose the positive examples consist of three instances: add(l, 2, 3), add(l, 1, 2), 
and add(2, 1,3). Now if we naively include the positive examples into the back- 
ground knowledge, then resolve the clause with each instance, the clause would 
cover all three instances. This clearly misrepresents the quality of the clause. So 
as each instance is resolved a graph is maintained of the instances an instance 
uses for its resolution. If a loop in the graph is detected by resolving an instance, 
then the recursive clause has an infinite recursive loop. Hence, one of the exam- 
ples should be a base case to solve this dilemma so that the current instance is 
considered not covered by the recursive clause and the dependence is kept loop 
free. This will give a better estimate of the coverage of the recursive clauses. 
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Figure 13 illustrates this process for our example. At stage 1 the first instance 
add(l, 2, 3) is tested by resolving it with the recursive clause which uses the third 
instance add(2, 1,3). Since no loops are created in the graph the first example 
is considered covered by the clause and the edge is added to the graph. At stage 
2 the second instance add(l,l,2) when resolved requires itself and would form 
a circuit in the graph, so the clause is considered not to cover the instance, and 
the graph is left unchanged. At stage 3 the third instance add(2, 1,3) would, 
when resolved, require the first instance add(l,2,3). This would form a circuit 
in the graph so the recursive clause could not cover both instances. One instance 
needs to be covered by another clause, hence, we estimate the recursive clause 
to still cover the first instance but not the third. Finally, the recursive clause is 
estimated to cover only the first of three instances. A naive approach would have 
it covering all three instance, which is a poor estimator of the recursive clause. 
Figure Q shows the stages Lime undertakens in this process. 



Test^ add (1 , 2 , 3 ) . 


add (1,2,3). i 


add (1,1,2). • j 


Test^ add (1 , 1 , 2 ) . i 


add (2,1,3) . 


add (2,1,3). i 


Staqe 1 


Staqe 2 


^;I^dd (1,2,3) . 


(1,2,3). 1 


^ add (1,1, 2) . • j 


\ ^ addd, 1,2) . . 


Test^ add (2 , 1 , 3 ) . 


^ add (2, 1,3) . . 


Staqe 3 


Final State 



Fig. 7. Example of estimating coverage of a individual recursive clause 



Also this approach directs the search toward the base cases by attempting 
to cover examples the recursive clause will finally cover in the logic program. 

Another problem that is encountered when learning recursive logic programs 
is a sparse set of positive examples. That is, if many of the examples are separated 
by more than one resolution step then the strategy of including the positive 
examples in the background knowledge will not help in determining coverage. 
Lime partly handles this problem by the use of a Prolog interpreter in the final 
stage of induction. Although, a recursive clause must show some potential when 
evaluated by itself to form part of the candidate clauses; this will not be the case 
when the example set is sparse. 
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8 Prior Probability 

To generate priors over the hypothesis space a probabilistic context-free grammar 
is used. A probabilistic context-free grammar (G, V) is a context-free grammar 
G where each production rule is assigned a probability pi. Note, 0 < pi < 1 
and the sum of the probabilities with the same nonterminal left-hand side is 1. 
The probability of a derivation is given by the product of the probabilities of 
the productions used in the derivation. The probability of a sentence, generated 
by the grammar, is the sum over all possible distinct derivations from the start 
non-terminal to the sentence. 

In many respects the way priors are attached is arbitrary. As the number 
of examples increases the prior becomes irrelevant. Basically, the probabilistic 
context-free grammar forms a mapping between sentences of the grammar, which 
are logic programs, and a probability value for the sentence, which is then defined 
to be the prior for this logic program. 

Another way of attaching priors would be to encode the hypothesis into a 
bit string, then calculate the prior from its length. In this approach care must 
be taken when encoding the logic program, as it requires a prefix code. This 
still forms a mapping between the logic program and its probability via a bit 
string. By using probabilistic context-free grammars the intermediate stage is 
removed, simplifying the task and allowing more flexibility in the formation of 
the mapping. 

The probabilistic context-free grammar m used by Lime for calculating 
priors of logic programs is given in table D The non-terminals LP, G, B, L, 



Pi 

P2 

Ps 

Pi 

Ps 



Pi 

P2 

P3 

P4 

P5 



1 

MC + 1 

^ “ MC + 1 
1 

1 

ML + 1 
^ “ ML+1 



Pe 



P6 



1 






Pi 



P7 



1 



nL 



LP e 
LP — CLP 
C — > head B 

B — ^ e 
B — > L,B 

arity^ 

L — > namei {T, T, - ■ ■ ,T) 
arity^ 

L — > name2(T, T, - ■ ■ ,T) 



Ps+rii P5+ni — 
Ps+rii+i Ps+Tii+i = 

P5 + TIL+2 P5+ni+2 = 1 - 




arity^^ 

name„^ (T, T, • ■ • , T) 

V 

T' 



Table 1. Grammar use by Lime for calculating prior of hypothesis. 
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and T correspond respectively to a logic program, clause, the body of a clause, 
literal, and term, is a parameter which sets the expected number of clauses 
in a logic program. Similarly is the expected number of literals in the body of 
clauses. And /iy is the expected variable number for any term in a logic program. 
Details such as “commas” and “periods” are ignored as they only have cosmetic 
effects. 

From this grammar the stochastic expectation matrix M many be calculated, 
given in table El where an element at the row corresponding to non-terminal X 
and the column corresponding to non-terminal Y is the expected number of times 
X will be replaced by Y in exactly one production rule. As the spectral radius 
p{M), which is the modulus of the largest eigenvalue, is always less then 1 the 
probabilistic grammar is consistent 0. That is, the sum over all the sentences 
generated from this grammar is 1. 



M = 



LP 

C 

B 

L 

T 



LP C B 

— 1 — 0 

MC + l MC + 1 

0 0 1 

0 0 1 ^ 

0 0 0 

0 0 0 



L 

0 

0 

1 

ML + l 
0 

0 



T 



0 

0 

0 



^ Erii arttVi 



1 - 



MV+l 



Table 2. Stochastic expectation matrix M for the logic program probabilistic context- 
free grammar. 



The calculation of P{h) in Lime is trivial, as there is a unique derivation from 
the start non-terminal LP to any logic program h. This derivation is simply ob- 
tained by parsing h and calculating the product of the probabilities assigned to 
the production rules. As Lime requires lg{P{h)) rather than P{h), lg{P{h)) is 
calculated directly. This moves the Ig into the calculation, changing multiplica- 
tion to addition. Also, the numbers used are manageable — they do not become 
exponentially small. 

Note, if any clause is added to a hypothesis then the prior must decrease, 
hence, for all h and c P{h) > P{h U {c}). This gives a simple way of calculating 
a bound on the maximum prior for any partly completed hypothesis. Although 
obvious, this bound is useful in restricting the search space. 



9 6 Estimation 

An estimate of 9{ext(h)) is required for computing the Q heuristic for a hypo- 
thesis h. In general. Lime considers many candidate hypotheses, so an efficient 
method of estimating the 0 value for a hypothesis is essential to make Lime 
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viable. Recall from Section |3 that the 6 value is a measure of the proportion of 
the instance space a hypothesis covers. 

It may be calculated exactly by taking the sum over the probability of each 
instance that is in the extension of the hypothesis as shown in equation 0. Ho- 
wever, this is impossible for three reasons: the extension of the hypothesis is 
usually infinite; the instance space distribution is unknown; and as the instance 
space distribution is a mapping to the reals, the result given by a 0 evaluation 
is not, in general, representable by a Turing machine, let alone computable by 
one. Hence, an approximate estimation of this value must be found. 

e= Yl Dx{e) ( 8 ) 

eGext(/i) 



Lime estimates 9 by taking a random sample of n instances, then calcula- 
ting the number c of these instances the hypothesis covers. Next, a Laplacian 
estimate, is used. Note, the random sample of instances is generated at 
the start of Lime’s execution, and the same sample is used for all 9 estimati- 
ons in simple clause construction, clause construction, and the induction of the 
final hypotheses. This is for reasons of efficiency, and it keeps the estimation 
consistent across different hypotheses. A more general hypothesis, therefore, will 
always have the same or a higher 9 estimation. 

To generate a single random instance, each term in the instance is randomly 
selected by a uniform distribution over the constants that constitute the term’s 
type. This implicitly assumes a uniform distribution over a finite restriction of 
the instance space, as prescribed by the ground terms given in the examples 
and background knowledge given to Lime. Although, this uniform distribution 
is in general different from the unknown distribution that generated the exam- 
ples, it will still produce useful estimates of 9, as the 9 estimate is used mainly 
to compare different hypotheses. The comparison by the 9 estimate does not 
change greatly under transformations in the instance space distribution. That 
is, if ext(ft.i) C ext(/i 2 ) then 9{hi) < 9{h2) for any instance space distribution. 
This process is repeated until the required number of samples is generated. By 
default Lime uses 500 instances to make up the random sample. 



10 Empirical Results 

In this section we present experimental results to illustrate how Lime achieves 
its design goals of better noise handling, learning from fixed set of examples, and 
of learning recursive logic programs. 



10.1 Recursive Logic Programs 

Bratko’s Logic Programs Examples A set of logic programs based on Ivan 
Bratko’s book PROLOG Programming for Artificial Intelligence |S| have been 
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generated to assess ILP systems. These were obtained from UCI Machine Lear- 
ning Repository and given to Lime without changing the examples or back- 
ground knowledge (some cosmetic changes are required to the file format to make 
it understandable to Lime). The examples consist of all the positive and nega- 
tive examples restricted to lists of maximum length 3; also, the constant symbols 
are restricted to the numbers 1, 2, and 3. Tabled shows the logic programs in 
question. The table also shows the background knowledge that is provided to 
the learner. As there is no noise in the examples the noise parameter is set to 0 
for Lime. Table 0 gives the results of both Lime and FOIL on these data sets. 
Note, that Lime successfully induces the target hypothesis for 11 out of the 16 
logic programs whereas FOIL was successful on 8 out of the 16 logic programs. 



Quick Sort The relation quick sort is used as a bench mark to test the ability 
of an ILP system. Quick sort is a difficult recursive relation to learn as the 
key recursive clause is complex. The complexity is due to the presence of two 
recursive literals in the body, and the size of this clause. Another, difficulty, 
especially with regard to Lime, is that the recursive clause is one big simple 
clause which has a depth of 3, hence. Lime must explore the space of simple 
clauses deep enough to discover this clause. However, Lime successfully induced 
the following logic program for quick sort: 

qsort(A, B) e- partitionl(A, C), partition2(A, D), 
concat(C', D, R), partition2(R, D). 
qsort(A, B) c— partitionl(A, C), partition2(A, D), 

qsort(L), E), qsort(D, F), concat(A, F, B). 

Lime took 393.57 seconds for inducing the above program. 

10.2 Noise 

We present results from three sets of representative experiments that compare 
Lime with FOIL and PROGOL. The first experiment considers learning the 
recursive predicate add with different levels of noise. The second experiment is 
performed on the complex krk domain, also with different noise levels. Third, 
we randomly generate a domain and consider how the number of clauses in the 
target concept affects predictive accuracy. 



Plus Two We first demonstrate Lime’s superior noise handling ability for the 
simple concept plus2, which may be represented by the following logic program: 
plus2(A, B) •<— inc(A, C), inc(C, B). 

In the above inc denotes the increment predicate available as background kno- 
wledge. A random selection of 50 positive and 50 negative examples are given 
to Lime. These examples include noise. The predictive error of the induced hy- 
pothesis is measured against a noise-free test set that is generated by taking the 
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Name 


Logic Program 


Background Knowledge 


Concatenation 


conc(A. S, C) ■<— empty(A), equal(S, C). 
conc(A, C) ■<— components(A, Z), E), 

components(C‘, D, F), 
conc{E. B. F) . 


components, member, empty, 
equal 


Delete 


del{A. B . C) components(S, C). 

del{A.B,C) <— components(S, D, £^), del(A, E, P'), 
components(C, D, F). 


components, member, cone, 
last 


Dividelist 


dividelist(^, C) i— Gmpty(^), empty(B), empty(C'). 

dividelist(^, C) ■<— odd(^) , components(^, D, P) , 

dividelist(P, F, C) 
components(P, D, F). 

dividelist(^, P, C) i— gvgii(A), components(A, D, P), 
dividelist(P, P, P), 
componGnts(C, D, F). 


components, member, cone, 
last, del, insert, 
sublist, permutation, even, 
odd, reverse, palindrome, 
shift, subset 


Evenlength 


GVGn(A) -4— Gmpty(^). 

Gven(A) -4— componen.ts(A, P, C), 

componen.ts(C, D, P), Gven(P). 


components, member, cone, 
last, del, insert, 
sublist, permutation 


Insert 


insert(A, P, C) <— del(A, C, B). 


components, member, cone, 
last, del 


Last 


last(A, P) -4— components(P, A, C), empty(C). 
last(A.P) -4— componGnts(P, C, P), l3-S‘t(A, P). 


components, member 


Member 


mGmber(A, P) -4— components(P, A, C). 
member(A.P) -4— componGnts(P, C, P) , mGmbGr(A, P). 


components, cone 


Oddlength 


odd(A) -4— componGnts(A, P, C), empty(C). 
odd(A) -4— components(A, P, C), components(C, P, P), 
odd(P). 


components, member, cone, 
last, del, insert, 
sublist, permutation 


Palindromel 


palindromG(A) -4— reverse(A, P), equal(A, P). 


components, member, cone, 
last, del, insert, 
sublist, permutation, 
even, odd 


Palindrome2 


palindrome(A) -4— empty(A). 

palindromG(A) -4— components(A, P, C), Gmpty(C). 
palindrome(A) -4— components(A, P, C), la-st(P, A), 
front{C‘, P). palindrome(C, P). 


components, member, cone, 
last, del, insert, 
sublist, permutation, even, 
odd, reverse 


Permutation 


pGrmutation(A, P) -4— Gmpty(A), Gmpty(P). 
pGrmutation(A, P) -4— componGnts(A, C, P), 
permutation(P, P), 
insert{C, P, P). 


components, member, cone, 
last, del, insert, 
sublist, permutation 


Reverse 


rGVGrse(A, P) -4— empty(A), empty(P). 
rGVGrse(A, P) -4— componGnts(A, C, P), 
empty(P), equal(P, A). 
rGVGrse(A, P) -4— componGnts(A, C, P), 
componGnts(P, P, P), 
last(C, P), last(P, A), front(P, G), 
front(P, P), reverse(G, H). 


components, member, cone, 
last, del, insert, 
sublist, permutation, even, 
odd, reverse 


Shift 


shift(A, P) -4— components (A, C, P), 

front(P, P), last(G, P). 


components, member, cone, 
last, del, insert, 
sublist, permutation, even, 
odd, reverse, palindrome 


Sublist 


sublist(A, P) -4— conc(A, G, P). 

sublist(A,P) -4— components(P, G, P), sublist(A, P). 


components, member, cone, 
last, del, insertshift 


Subset 


subset(A, P) -4— empty(A). 
subset(A, P) -4— components(A, G, P), 

member(G, P), subset(P, P). 


components, member, cone, 
last, del, insert, 
sublist, permutation, even, 
odd, reverse, palindrome, 
shift 


Translate 


trcmslate(A, P) -4— empty(A), empty(P), empty(G). 
trainslatG(A, P) -4— components(A, G, P), means(G, P), 
components(P, P, P), 
translate(P, P). 


components, member, cone, 
last, del, insert, 
sublist, permutation, even, 
odd, reverse, palindrome, 
shift, means 



Table 3. Bratko’s recursive logic programs 
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Name 


Lime 


Foil 


Concatenation 


Unsuccessful 


Successful 


Delete 


Successful 


Successful 


Dividelist 


Unsuccessful 


Unsuccessful 


Evenlength 


Successful 


Unsuccessful 


Insert 


Successful 


Successful 


Last 


Successful 


Successful 


Member 


Successful 


Successful 


Oddlength 


Successful 


Unsuccessful 


Palindromel 


Successful 


Successful 


Palindrome2 


Successful 


Unsuccessful 


Permutation 


Unsuccessful 


Unsuccessful 


Reverse 


Unsuccessful 


Unsuccessful 


Shift 


Successful 


Unsuccessful 


Sublist 


Successful 


Successful 


Subset 


Successful 


Unsuccessful 


Translate 


Unsuccessful 


Successful 



Table 4. Bratko’s recursive logic programs - empirical results of Lime and Foil. 



“first” 20 positive examples and a random selection of 20 negative examples. 
This process is repeated 100 times to calculate the average predictive error. This 
is repeated with different noise levels and the results are shown in Figure 0 
The error bars in the figure indicate the sample standard deviation. The results 
show that Lime is able to correctly learn the concept with noise levels of up to 
approximately 70%. The same test is carried out with FOIL and PROGOLO 
Lime performs better than FOIL and PROGOL for noise levels of up to ap- 
proximately 70%. Here, FOIL over-generalizes inducing a less predictive hypo- 
thesis. This is mainly due to the covering approach which introduces unnecessary 
clauses. However, for noise levels greater than 70%, all three systems perform 
poorly. 



Addition Lime’s noise handling ability is demonstrated in the context of add 
(the addition relation) — a target predicate that requires a recursive definition. 
The target concept may be represented by the hypothesis: 

add(A,R,C') •<— equal(A, C), zero(i?). 

add(A, B, C) e- inc(D, R), add(A, D, E), ±nc(E, C). 

We take a random selection of 200 positive and 200 negative examples but 
perform only 20 repetitions at each noise level. Figure El shows the relationship 
between noise and predictive error measured against a noise-free test set of the 
“first” 25 positive examples and a random set of 25 negative examples. The 
results show that the gap between Lime and other systems widens further when 
the target concept requires a recursive definition. Experiments with FOIL and 
PROGOL were limited to 40% and 15% noise levels respectively because the 
quality of the programs output by these systems beyond these noise levels were 
difficult to assess. 



12 



All our experiments are with FOIL, version 6.3 and with PROGOL, version 4.1. 
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Fig. 8. Predictive Error vs Noise for plus2 



KRK Domain The KRK domain has been well studied especially with respect 
to noise This concept is exactly representable in first order logic, though 

the representation requires several clauses. The relation illegal checks if an end 
game position is an illegal chess position. 

In each trial the training example set is constructed from 300 examples of 
which a proportion are noisy. Examples that are not noisy are chosen by ran- 
domly selecting the rank and file of each piece, where the distribution is uniform 
over both rank and file, then determining if the state is illegal and labeling it 
appropriately. A noisy example is constructed by again randomly positioning 
each piece, then randomly labeling it as either illegal or not-illegal. Note, a noisy 
example may be correctly classified. The accuracy of a hypothesis produced by 
the learning system is estimated by creating 10000 random examples and cal- 
culating the proportion of these the hypothesis correctly classifies. Each trial is 
repeated 20 times and the mean and sample standard deviation of the accuracy 
is calculated. 

Quinlan’s decision tree learner c4-.5 is also considered in this domain. The 
attribute giving the rank and file distance between pieces is included to help c4-.5 
represent the concept. The results of these experiments are shown in Figure cni 
The error bars show the sample standard deviation. 

The diagram shows that PROGOL induces a more accurate hypothesis for 
low levels of noise, however Lime performs better at higher levels of noise. The 
predictive error shown by FOIL appears to be linearly dependent on the noise 
in the the training set. The poorer result shown by C/^.5 is partly due to its 
inadequacy in representing the concept. 



368 



E. McCreath and A. Sharma 



O 

0) 
> 
4— ' 

O 

X3 

0 

CL 




Noise 



Fig. 9. Predictive error vs Noise for add 



Randomly Generated Domain A randomly generated domain is created to 
examine how the ILP systems perform as the concept becomes more complex. 
A simple approach to introducing complexity into a concept is to include more 
clauses. So our measure of complexity here is the number of clauses in the target 
hypothesis. 

Each target predicate consists of two terms, which when grounded are in- 
tegers from 0 to 29. This yields an instance space of size 900. The background 
knowledge consists of 10 unary randomly generated predicates. Each clause con- 
sists of exactly 2 literals which are randomly selected from the background kno- 
wledge. Training and test sets are constructed by first randomly generating a 
target hypothesis, with the set number of clause, then this hypothesis is used to 
classify the 900 training instances. These are divided into training and test sets 
with a 90%/10% split, respectively. Before the training set is given to the learner 
it is corrupted by 10% noise. This process is repeated 10 times and the average 
error and sample standard deviation is calculated. Figure ^Qshows these results 
for 1 to 10 clauses. 

Interestingly FOIL shows the same predictive error independent of the num- 
ber of clauses considered, whereas both PROGOL and Lime become less accurate 
as the number of clauses and hence the complexity, increases. 

10.3 Learning from Positive Examples and from Negative Examples 

Plus Two This set of experiments gives empirical evidence that positive exam- 
ples are more useful than negative examples for a target concept that is “small” 
with respect to the instance space distribution. The experiments also give evi- 
dence for the converse that negative examples are more useful than positive 
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Noise introduced to training examples 



Fig. 10. The predictive error as noise is introduced into the training examples in the 
KRK domain. 
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Fig. 11. The predictive error vs number of clauses in the randomly generated domain. 
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Fig. 12. Error vs i (i = number of positive examples & 8-i = number of negative 
examples) for the plus2 and notplus2 logic programs 



examples for a target concept that is “large” with respect to the instance space 
distribution. These experiments also establish that Lime is capable of learning 
from only positive data and from only negative data. 

We consider two concepts, the plus2 and notplus2 (the complement of 
plus2 — that is, notplus2(A, i?) holds if i? 7^ A + 2). It is easy to see that 
under reasonable assumptions, plus2 is a “small” concept and notplus2 is a 
“large” concept. Assuming the instance space X = { l..n then under a uni- 
form distribution the concept covers of the instance space. In the experiment 
n = 50 and hence the plus2 concept covers 0.0192 of the instance space. The 
background knowledge is identical for both plus2 and notplus2 consisting of 
the increment relation and its complement, and a constant relations for each 
number in the range. Lime is run on examples of plus2 and notplus2. The 
total number of examples is invariant over each test, however, the number of 
positive examples is increased as the number of negative examples is decreased. 
Each test is repeated 100 times and the results for both plus2 and notplus2 
are shown in Figure El In all experiments the test set consists of 100 randomly 
selected positive examples and 100 randomly selected negative examples. The 
error bars show the sample standard deviation. 

These experiments are also repeated at different levels of noise used in gene- 
rating the training examples. The graph for plus2 and notplus2 are shown in 

Figure El 
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Fig. 13. Error vs i (i = number of positive examples & 8-i = number of negative 
examples) for the plus2 relation shown on the left and the notplus2 relation shown 
on the right. These trails are conduced with different levels of noise used in generating 
the training examples. 



Addition In this set of experiments we examine the number of examples requi- 
red for Lime to induce the addition relation from only positive examples. Note, 
as there is only positive examples the empty bodied clause is complete and con- 
sistent with respect to these examples. However, with enough positive examples 
the Q heuristic favors the addition relation over the empty body clause, as, the 
0 estimate for addition is smaller, this outweights the effect of the larger prior 
probability for the empty body clause. 

The positive examples are randomly generated using a distribution over the 
instance space. Rather than restricting the instance space to a finite domain 
and using a uniform distribution over this space, a distribution over all three 
term predicates is used. The advantage of this approach is small numbers are 
given higher probabilities and hence occur more frequently in the example set. 
This aids the induction of the base case in the recursive addition relation. The 
background knowledge consists of the znc, zero, and equal relations these are 
necessary and sufficient to learn addition. Initially 10 positive examples are ran- 
domly generated and given to Lime, once Lime has run, the induced hypothesis 
is tested on the “first” 25 positive examples and a random selection of 25 ne- 
gative examples. We also note if the hypothesis exactly expresses the addition 
relation. This is repeated 10 time and the mean and sample standard deviation 
is tabulated, also, the number of times the correct relation is induced is counted. 
This is repeated for different number of positive examples in increments of 10. 
The graph in figure El shows average error decreasing as the number of posi- 
tive examples increases and the number of tests Lime takes to induce the exact 
addition relation. 

Learning the addition relation from only negative examples is not investigated 
as positive examples are used in estimating the preliminary cover of recursive 
clauses and since these are absent in negative only data Lime will not learn 
recursive logic programs from only negative data. 
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Number of Positive Examples 



Number of Positive Examples 



Fig. 14. Lime inducing addition from only positive examples. With error vs number of 
training examples on the left and number correctly induced hyptheses, out of 10 trials, 
vs number of training examples on the right. 



11 Conclusion 

The design of the ILP system Lime was described. The notion of simple clause 
was introduced and its use in the design of Lime was discussed. It was shown 
that combining simple clauses to form candidate clauses provides an effective 
alternative to growing clauses one literal at a time. Detailed algorithms for simple 
clause construction, clause construction, and logic program construction were 
given. Empirical results were presented that reinforce the superior noise handling 
ability of Lime. The performance of Lime is particularly good when it is learning 
recursive definitions in the presence of noise. 

Work in progress involves application of Lime on real world domains and 
experiments with a boosted version of Lime. 
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Abstract. A neural tree is a feedforward neural network with at most 
one edge outgoing from each node. We investigate the number of ex- 
amples that a learning algorithm needs when using neural trees as hy- 
pothesis class. We give bounds for this sample complexity in terms of 
the VC dimension. We consider trees consisting of threshold, sigmoidal 
and linear gates. In particular, we show that the class of threshold trees 
and the class of sigmoidal trees on n inputs both have VC dimension 
f?(nlogn). This bound is asymptotically tight for the class of threshold 
trees. We also present an upper bound for this class where the constants 
involved are considerably smaller than in a previous calculation. Finally, 
we argue that the VC dimension of threshold or sigmoidal trees cannot 
become larger by allowing the nodes to compute linear functions. This 
sheds some light on a recent result that exhibited neural networks with 
quadratic VC dimension. 



1 Introduction 

The sample complexity, that is, the number of examples required for a learning 
algorithm to create hypotheses that generalize well, is a central issue in the theory 
of machine learning. In this paper we study the sample complexity for hypothesis 
classes consisting of neural trees. These are feedforward neural networks where 
each node, except the output node, has exactly one outgoing connection. Since 
these networks have less degrees of freedom, they are expected to be learnable 
more efficiently than unrestricted neural networks. 

The computational complexity of learning using trees has been extensively 
studied in the literature. Angluin et al. PQ, for instance, investigated the existence 
of efficient algorithms that use queries to learn Boolean trees, also known as 
read-once formulas. Research on trees employing neural gates has been initiated 
by Golea et al. 0. They designed algorithms for learning so-called /x-Perceptron 
networks with binary weights. (A /x-Perceptron network is a disjunction of thres- 
hold gates where each input node is connected to exactly one threshold gate.) 

* Part of this work was done while the author was with the Institute for Theoretical 
Computer Science at the Technische Universitat Graz, A-8010 Graz, Austria. 



M.M. Richter et al. (Eds.): ALT’98, LNAI 1501, pp. 375-^^^ 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 
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They also considered tree structures in the form of /r-Perceptron decision lists 
0 and nonoverlapping Perceptron networks m- 

We investigate the sample complexity for neural trees in terms of their 
Vapnik-Chervonenkis (VC) dimension. It is well known that the VC dimension 
of a function class gives asymptotically tight bounds on the number of training 
examples needed for probably approximately correct (PAC) learning this class. 
For detailed definitions we refer the reader to Moreover, these estimates 

of the sample complexity in terms of the VC dimension hold even for agnostic 
PAC learning, that is, in the case when the training examples are generated by 
some arbitrary probability distribution m- Furthermore, the VC dimension is 
known to yield bounds for the complexity of learning in various on-line learning 

models nmsi. 

Results on the VC dimension for neural networks abound. See, for instance, 
the survey by Maass m- We briefly mention the most relevant ones for this 
paper. A feedforward network of threshold gates is known to have VC dimension 
at most 0{wlogw) where w is the number of weights Pj. Networks using piece- 
wise polynomial functions for their gates have VC dimension O(w^) [Zj whereas 
for sigmoidal networks the bound 0{w‘^) is known P2|- With respect to lower 
bounds it has been shown that there are threshold networks with VC dimension 
J7 (ui logic) j I t)j . Furthermore, networks with VC dimension f2(w'^) have been 
exhibited P3|- Among these are networks that consist of both threshold and 
linear gates, and sigmoidal networks. 

Bounds on the VC dimension for neural networks are usually given in terms of 
the number of programmable parameters, that are, most commonly, the weights, 
of these networks. In contrast to the majority of the results in the literature, 
however, we are not looking at the VC dimension of a single network with a 
fixed underlying graph, but of the entire class of trees employing a specified 
activation function. This must be taken into account when comparing our results 
with other ones. 

Hancock et al. cni have shown that the VC dimension of the class of trees 
having threshold gates as nodes — they called them nonoverlapping Perceptron 
networks — is 0(n log n) where n is the number of inputsQ We take this result as 
a starting line for our work. The basic definitions are introduced in Section |2| In 
SectionP|we show that the class of threshold trees has VC dimension f7(nlogn), 
which is asymptotically tight. Moreover, we show that this bound remains valid 
even when all trees are required to have depth two and the output gate com- 
putes a disjunction. This lower bound is then easily transferred to trees with 
sigmoidal gates. In Section 0 we provide a new calculation for an upper bound 
that considerably improves the constants of m- Section 0 is a short note on 
how to derive the upper bound O(n^) for the class of sigmoidal trees. Finally, 
in Section 0 we show that adding linear gates to threshold or sigmoidal trees 
cannot increase their VC dimension. Interestingly, it was this use of linear ga- 

^ Since a neural tree on n inputs has 0(n) weights, it is convenient to formulate the 
bounds in terms of the number of inputs. We follow this convention throughout the 
paper. 
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tes that lead to a quadratic lower bound for sigmoidal neural networks in the 
work by Koiran and Sontag m- Consequently, if the lower bound C(nlogn) 
is not tight for sigmoidal trees one has to look for new techniques in search for 
asymptotically better bounds. 



2 Basic Definitions 

A neural tree is a feedforward neural network where the connectivity, or archi- 
tecture, of the network is a tree, that is, there is at most one edge outgoing 
from each node. Further, there is exactly one node, the root of the tree or output 
node, that has no edge outgoing. A neural tree on n inputs has n leaves, also 
called input nodes. The depth of a tree is the length of the longest path from an 
input node to the output node. The nodes that are not leaves are also known 
as computation nodes. Each computation node has associated with itself a set of 
A: -|- 1 real-valued parameters where k is the in-degree of the node: the weights 
wi, . . . ,Wk and the threshold t. 

We use trees for computations over the reals by assigning functions to its 
computation nodes and values to their parameters. We consider three types of 
functions that the nodes may use. All types can be obtained by applying a so- 
called activation function to the weighted sum wiXi -I- • • ■ -I- WkXk — t where 
xi, . . . ,Xk are the input values for the node computed by its predecessors. (The 
values computed by the input nodes are the input values for the tree.) A node 
becomes a threshold gate when it uses the signum function with sign(y) = 1 
if y > 0) and sign(?/) = 0 otherwise. A sigmoidal gate is a node that uses the 
sigmoidal function 1/ (l-|-e“^). Finally, a linear gate applies the identity function, 
that is, it simply outputs the weighted sum. 

We say that a neural tree is a threshold tree if all its computation nodes are 
threshold gates. Correspondingly, we speak of sigmoidal trees and linear trees. 
If we allow more than one type of activation function for a tree then we shall 
assume that each of its computation nodes may use all types specified. Since 
we restrict our investigations to trees that compute {0, l}-valued functions, we 
assume the output of a tree to be thresholded at 1/2 if the output node is a 
linear or sigmoidal gate. Thus we can associate with each tree on n inputs a 
set of functions from M" to {0,1} which are obtained by choosing activation 
functions for its nodes and varying its parameters over the reals. In a class of 
trees all members have the same number of inputs, denoted by n, and choose 
gate functions for their nodes from a specified set, which can be of the three 
types introduced above. The set of functions computed by a class of trees is then 
defined straightforward as the union of the sets computed by its members. 

A dichotomy of a set S C R" is a partition of S into two disjoint subsets 
Sq, Si such that S'oUS'i = S. Given a set T of functions from R” to (0, 1} and a 
dichotomy Sq, Si of S, we say that T induces the dichotomy Sq, S\ on S if there 
is a function f € T such that /(5'o) C {0} and f{Si) C (Ij. We say further that 
T shatters S ii iF induces all dichotomies on S. The Vapnik-Chervonenkis (VC) 
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dimension of T, VCdim(iF), is defined as the largest number m such that there 
is a set of m elements that is shattered by T . 

3 Lower Bounds on the VC Dimension for Threshold and 
Sigmoidal Trees 

In this section we consider neural trees consisting of threshold and sigmoidal 
gates. We first establish a lower bound for the VC dimension for a class of 
threshold trees with certain restrictions: We assume that each tree has only one 
layer of hidden nodes and that the output node computes a disjunction. 

Theorem 1. For each m,k>l there exists a set S C {0, ^ qJ cardi- 

nality \S\ = m ■ k that is shattered by the class of depth-two threshold trees with 
disjunctions as output gates. 

Proof Let the set S C {0, ^ defined as 

S = {ci : i = 1, . . . , to} X {dj : j = 1, k} 

where Ci G {0, 1}™ is the i-th unit vector and dj G {0, ^ is specified 

as follows: Let Ai, , A 2 k be an enumeration of all subsets of {1, . . . , fcj. We 
arrange the k ■ components of dj in 2^ blocks 5i, . . . , 62* such that block bi 
has length |xl;| for I = 1, ... ,2*. (Hence the block corresponding to the empty 
set has length 0.) Thus all blocks together comprise exactly k-2^~^ components. 

Fix j G {1, . . . , fcj, I G 2^1 and consider the components of block bi 

in dj. Each of these components is assumed to represent one specific element of 
A[. For such an element a G Ai let dj{bi,a) denote the value of this component. 
We define 

. J 1 if a = j 

j\ hd) Q otherwise . 

Thus we proceed for j = 1, . . . , k and I = 1, . . . , 2^. Observe that due to this 
construction, each block bi contains at most one 1. (Furthermore, the number of 
Is in dj is equal to 2^~^ since j occurs in exactly half of all subsets Ai.) 

Obviously, S consists of to • fc elements. We claim that S is shattered by the 
class of depth- two threshold trees having a disjunction as output gate. In order 
to prove this we show that for each S' Q S there are weights and thresholds 
for such a tree such that this network outputs 1 for elements in S', and 0 for 
elements in S\S' . Fix S' C S. For z = 1, . . . , to let a{i) be the (unique) element 
in {!,..., 2^1 such that 



^a{i) = {j ■ e^dj G S'} . 

For convenience we call the input nodes 1, . . . ,to the e-inputs, and the input 
nodes to-|-1,...,to-|-A: - 2*“^ the d-inputs. We employ for each element in 
the range of a a threshold gate Ga(i) that has connections from all d-inputs in 
block and from none of the other blocks bi where I yf a(z). Further, we 
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connect e-input i to gate Go,(i) for i = (Notice that this may result 

in gates that have connections from more than one e-input.) The weights of all 
connections are fixed to 1 and the thresholds are set to 2. Obviously, there is at 
most one connection outgoing from each input node, so that the disjunction of 
these threshold gates is a tree. 

Finally, we verify that the network computes the desired function on S. Sup- 
pose that X G S' where x = Cidj. The definition of a implies that j G ^a(i). 
Hence j) is defined and has value dj{ba(i),j) = 1. Since gate Ga(i) re- 

ceives two Is — one from e-input i and one from block ba(i) — the output of the 
network for eidj is 1. 

Assume on the other hand that eidj G S\S' . Then j ^ A^^i) and gate Ga(i), 
which is the only gate that receives a 1 from an e-input, receives only Os from 
block ba(i)- All other gates Gi, where I ^ c({i), receive at most one 1, which is 
then from block 6; only. Hence, the output of the network for Cidj is 0. □ 



Choosing m = nj2 and k = (log(n/2))/2 -|- 1 in Theorem 0 we have m + 
k ■ < n/2 -I- = n. Hence there is a set S C {0, 1}" of cardinality 

m ■ k = f2{n log n) that is shattered by the class of trees considered. 

Corollary 2. The VC dimension of the class of threshold trees on n inputs is 
l7(nlogn). This even holds if all input values are binary and the class is restricted 
to trees of depth two with a disjunction as output gate. 

It is well known that in a network that computes a Boolean function, a 
threshold gate can be replaced by a sigmoidal gate without changing the function 
of the network. (If necessary, the weights have to be scaled appropriately. See, for 
instance, Maass et al. HD for a treatment in the context of circuit complexity). 
Thus, the lower bound l7(nlogn) also holds for depth- two trees that may consist 
of threshold or sigmoidal gates. 

Corollary 3. The class of depth-two trees on n inputs with threshold or sig- 
moidal gates has VC dimension f?(nlogn). This even holds if the inputs are 
restricted to binary values. 

We note that depth two is minimal for this lower bound since a threshold 
gate and a sigmoidal gate both have VC dimension n -I- 1: This follows for the 
threshold gate from a bound on the number of regions generated by a certain 
number of hyperplanes in IR” which is due to Schlafli m (see also |^). For the 
sigmoidal gate this follows from the fact that its pseudo dimension is n -I- 1 El 

Together with the upper bound 0(n log n) due to Hancock et al. ^0] we get 
asymptotically tight bounds for the class of threshold trees. 

Corollary 4. The VC dimension of the class of threshold trees on n inputs is 
&{nlogn). This holds also for the class of depth-two threshold trees. Moreover, 
this bound is valid for both classes if the inputs are restricted to binary values. 
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4 Improved Upper Bounds for Threshold Trees 



In this section we establish upper bounds for the VC dimension of the class of 
threshold trees. Regarding the constants they are better than a previous bound 
derived by Hancock et al. m which is 13nlog(2en) + 4nloglog(4n). 

Theorem 5. The class of threshold trees on n inputs, where n > 16e, has VC 
dimension at most 6nlog(\/3n). 



Proof. We estimate the number of dichotomies that are induced by the class of 
threshold trees on an arbitrary set of cardinality m. First we bound the number 
of these trees, then we calculate an upper bound on the number of dichotomies 
that a single tree induces when all its weights and thresholds are varied. 

For the number of trees on n inputs we use the upper bound (4n)"“^, which 
was derived in nni Lemma 3]. We assume without loss of generality that the 
computation nodes at the lowest level (i.e., those nodes that have input nodes 
as predecessors) have in-degree 1 and that all other computation nodes have in- 
degree at least 2. Each of the computation nodes at the lowest level induces at 
most 2m dichotomies on a set of cardinality m. The whole level induces therefore 
at most (2m)" different functions. The computation nodes with in-degree at least 
2 form a tree that consists of at most n — 1 nodes and has at most 2n — 2 edges 
leading to one of these nodes. 

According to a result by Shawe- TayloJl [2H the number of dichotomies that 
a threshold network with N computation nodes, partitioned into ly equivalence 
classes, and W edges induces on a set of cardinality m is at most 



2 ‘' 



emN 

W-n 



W-iy 



Using N = n— 1, v = n — 1, and W = 2n — 2 we get that the number of 
dichotomies induced by a threshold tree consisting of n — 1 computation nodes 
and 2n — 2 edges is at most 2"“^(em)"“^. 

Putting the bounds together, the total number of dichotomies induced on a 
set of cardinality m by the class of threshold trees on n inputs is at most 

(4n)”-^-(2m)”-2”-i(em)”-^ . 



Assume now that a set of cardinality m is shattered. Then 
2™ < (4n)"-i • (2m)" • 2"-i(em)"-^ 
= 2(16en)"-i •m^"-i 
< 2(mn)2"-i . 



For the last inequality we have used the assumption n > 16e. Taking logarithms 
on both sides we obtain 



m < (2n — 1) log(mn) -I- 1 . 

^ We do not make use of the equivalence relations involved in this result but of the 
improvement that it achieves compared to P|. 
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We weaken this to 



m < 2nlog(mn) . 



( 1 ) 



Assume without loss of generality that m > logn. Then it is easy to see that 
for each such m there is a real number r > 1 such that m can be written as 
m = rlog(rn). Substituting this in (0 yields 



rlog(rn) < 2n(log(mlog(rn))) 

= 2n(log(rn) + loglog(rn)) (2) 

< 3nlog(rn) . (3) 

The last inequality follows from log(rn) < y/rn which holds since rn > 16e. We 
divide both sides by log(rn) and get 



r < Sn . 



( 4 ) 



This implies 

rlog(rn) < 3nlog(3n^) . 

Resubstituting m = rlog(m) for the left hand side and rearranging the right 
hand side yields 

m < 6nlog('\/3n) 

as claimed. □ 



In the statement of Theorem 0 the number n of inputs is required to satisfy 
n > 16e. We shall show now that we can get the upper bound as close to 
4nlog(\/2n) as we want provided that n is large enough. 

Theorem 6. For each e > 0, the class of threshold trees on n inputs has VC 
dimension at most 4(1 + e)nlog(-\/2(l + e)n) for all sufficiently large n. 

Proof. (Sketch) Fix e > 0. For n sufficiently large we have log(rn) < (rn)^. 
Using this in the inequality from (|2|) to (0| we can infer r < 2(1 + e)n in place 
of 0. This leads then to the claimed result. □ 

5 A Note on the Upper Bound for Sigmoidal Trees 

Using known results on the VC dimension it is straightforward to derive the 
upper bound O(n^) for sigmoidal trees. We give a brief account. 

Proposition 7. The class of sigmoidal trees on n inputs has VC dimension 
O(n^). 

Proof. The VC dimension of a sigmoidal neural network with w weights is 0(w'^). 
This has been shown by Karpinski and Macintyre H21. By Sauer’s Lemma (see 
e.g. PI) the number of dichotomies induced by a class of functions with VC 
dimension d > 2 on a set of cardinality m > 2 can be bounded by m‘^. Thus a 
sigmoidal tree on n inputs induces at most I dichotomies on such a set. 

Combining this with the bound (4n)"“^ employed in the proof of Theorem 0 
and using similar arguments we obtain the bound as claimed. □ 
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6 Trees with Linear Gates 

By the work of Goldberg and Jerrum [7j it has been known that neural networks 
employing piecewise polynomial activation functions have VC dimension 0{w'^), 
where w is the number of weights. The question whether this bound is tight for 
such networks has been settled by Koiran and Sontag They have shown 
that networks consisting of threshold and linear gates can have VC dimension 
f2(w^). This result was somewhat unexpected since networks consisting of linear 
gates only compute linear functions and have therefore VC dimension 0{w). On 
the other hand, networks consisting of threshold gates only have VC dimension 
0{wlogw). This follows from work by Cover 0 and has also been shown by 
Baum and Haussler Results that this bound is tight for threshold networks 
are due to Sakurai m and Maass unj. 

Therefore, the question arises whether a similar increase of the VC dimension 
is possible for threshold or sigmoidal trees by allowing some of the nodes to 
compute linear functions. We show now that this cannot happen. 

Theorem 8. Let T he a class of neural trees consisting of threshold or sigmoidal 
gates and let be a class of trees obtained from T by replacing some of the 
gates by linear gates. Then VCdim(T) > VCdim(T*“). 

Proof. We show that nodes computing linear functions can be replaced or eli- 
minated without changing the function of the tree. Assume T is a tree and y is 
a node in T computing a linear function. If y is the output node then y can be 
replaced by a threshold gate or a sigmoidal gate, where weights and threshold 
are modified if necessary. (Note that the output of the tree is thresholded at 1 /2 
for linear and sigmoidal output gates.) 

If ?/ is a hidden node then there is a unique edge e outgoing from y to its 
successor z. Denote the weight of e by w. Assume that y computes the function 
uiXi -I- • • • -k UkXk — t where x\, . . . ,Xk are the predecessors of y, and ui, . . . ,Uk,t 
are its weights and threshold. We delete node y and edge e, and introduce k 
edges from xi, . . . , Xfc respectively to z. We assign weight wui to the edge from 
Xi for * = 1, . . . , n and decrease the threshold of z by wt. It can readily be seen 
that the resulting network is still a tree and computes the same function as T. 

□ 

Combining Theorems E]and0 with Theorem 0 we obtain an upper bound for 
the class of trees with threshold or linear gates. 

Corollary 9. The class of neural trees on n inputs with threshold or linear 
gates has VC dimension at most 6nlog(\/3n) for n > 16e. Furthermore, for 
each e > 0, this class has VC dimension at most 4(1 -k e)n log(-\/2(l + e )n) for 
all sufficiently large n. 

The technique used in the proof can also be applied to neural trees employing 
a much wider class of gates. If the function computed by a gate can be decom- 
posed into a linear and a non-linear part then the method of deleting a hidden 
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linear node works the same way. Only if the node to be treated is the output 
node there have to be some further demands on its function. For instance, if the 
non-linear part of this function is monotonous then a linear output node can be 
replaced by such a gate without decreasing the VC dimension of the tree. 



7 Conclusions 

Finding methods that incorporate prior knowledge into learning algorithms is 
an active research area in theoretical and applied machine learning. In the case 
of neural learning algorithms such knowledge might be reflected in a restricted 
connectivity of the network generated by the algorithm. We have studied the 
impact of a particular kind of such a restriction on the sample complexity for 
neural networks. Results were given in terms of bounds for the VC dimension. 

We have established the asymptotically tight bound C(nlogn) for the class 
of threshold trees. We have also derived an improved upper bound for this class. 
Due to our result demonstrating that the use of linear gates in threshold trees 
cannot increase their VC dimension, a known technique to construct networks 
with quadratic VC dimension does not work for trees. As a consequence of this, 
the gap between the currently best known lower and upper bounds for the class 
of sigmoidal trees, which are l7(nlogn) and 0{n'^), is larger than it is for sig- 
moidal networks. To reduce this gap and to extend these investigations to other 
frequently used types of gates are challenging open problems for future research. 

Acknowledgement. I thank an anonymous referee for helpful comments lea- 
ding to a clarification in the proof of Theorem Q] 
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Abstract. In this paper, we give learning algorithms for two new sub- 
class of DNF formulas: poly-disjoint One-read-once Monotone DNF ; and 
Read-once Factorable Monotone DNF, which is a generalization of Read- 
once Monotone DNF formulas. Our result uses Fourier analysis to con- 
struct the terms of the target formula based on the Fourier coefficients 
corresponding to these terms. To facilitate this result, we give a novel 
theorem on the approximation of Read-once Factorable Monotone DNF 
formulas, in which we show that if a set of terms of the target formula 
have polynomially small mutually disjoint satisfying sets, then the set 
of terms can be approximated with small error by the greatest common 
factor of the set of terms. This approximation theorem may be of inde- 
pendent interest. 



1 Introduction and Previous Work 

Since the inception of computational learning theory in the PAC (Probably Ap- 
proximately Correct) learning model due to Valiant [Val 84], the problem of the 
learnability of DNF has received much attention. One of the reasons for this is 
the potential of DNF as a form of knowledge representation, with applications in 
expert systems and data mining. DNF is also interesting in that it appears to be 
near the boundary of learnability. Learning general Boolean formulas and log- 
depth circuits is known to be as hard as factoring [KV 88] . The results of Lund 
and Yannakakis [LY 93] on the hardness of approximating the A:-coloring pro- 
blem and the results of Pitt and Valiant [PV 86] reducing the coloring problem 
to the DNF learning problem show that s-term DNF formulas are not learnable 
by s-term DNF hypotheses for some e, unless NP = RP. In contrast, several 
sub-classes of DNF formulas are known to be learnable. For an excellent survey 
of the DNF learning problem, we refer the reader to [AP 95] . 

Due to the apparent difficulty of learning DNF for arbitrary distributions, 
research efforts have focused on learning this class for specific distributions, 
in particular for the uniform distribution. The first distribution-specific results 
were given by Kearns, Li, Pitt and Valiant [KLPV 87,KLV 94] for learning /x- 
DNF formulas, in which every attribute occurs at most once in the formula, on 
the uniform distribution. Hancock [H 92] further studied restricted-read DNF, 
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and gave polynomial time algorithms for learning k/i-DNF, where each attribute 
occurs at most k times in the formula, on the uniform distribution. 

There have been several positive results for the learnability of DNF on the 
uniform distribution, although no polynomial time algorithms are known. Linial, 
Mansour and Nisan [LMN 89] show how to learn AC^ circuits on the uniform 
distribution by learning the Fourier coefficients in time. In [Ver 90a], 

we show that DNF is learnable under the uniform distribution with a similar 
time bound, but for which the output hypothesis is a DNF formula. 

In the membership query learning model, Mansour [Man 95] showed that 
DNF can be learned on the uniform distribution in time. A po- 

lynomial time algorithm is given by Khardon [Khar 94] for learning Disjoint- 
DNF, where every example satisfies at most one term. In a break-through result, 
Jackson [J 94] showed that DNF is learnable in polynomial time on the uniform 
distribution when membership queries are allowed. It remains an open question 
whether Monotone DNF is learnable on the uniform distribution in polynomial 
time using only examples. 

The learnability of Read-once Boolean formulas has been studied by several 
authors in the membership query and equivalence query models. Angluin, Hel- 
lerstein and Karpinski [AHK 93] show that Monotone Read-once formulas can 
be learned from membership queries alone, and that Read-once formulas can be 
learned using membership and equivalence queries. In the PAC-learning model, 
Goldman, Kearns and Schapire show in [GKS 90] that Read-once formulas can 

9 6 

be learned on the uniform distribution in time O(^), using O(^) examples. 
This result is generalized to product distributions in [S 92] , giving an algorithm 
with time complexity O(^) and sample complexity O(^). 

In this paper, we introduce the classes of One-read-once Monotone DNF 
formulas, and Read-once Factorable Monotone DNF formulas. We give a po- 
sitive learnability results for poly-disjoint One-read-once Monotone DNF, and 
for Read-once Factorable Monotone DNF on the uniform distribution. The class 
of Read-once Factorable Monotone DNF is a superclass of Read-once Monotone 
DNF, but a sub-class of Read-once Monotone Boolean formulas. Thus, the learn- 
ability of this class is implied by [GKS 90] and [S 92]; however, the complexity 
of our algorithm is of a lower order: time and sample complexity O(n^). This 
complexity is not directly comparable to the algorithms of [GKS 90], since the 
complexity of their algorithm is in terms of n, and ours is in terms of s. However, 
our results give a lower complexity for small formulas. The class of One-read- 
once Monotone DNF is also generalization of Read-once Monotone DNF, but 
since the algorithms we give here work only for poly-disjoint formulas from this 
class, this result is incomparable to the results for Read-once formulas discussed 
above. 

For the results of this paper, we use spectral analysis to show the correspon- 
dence between the Fourier coefficients of terms of a One-read-once Monotone 
DNF formula and the probability weight of the set of vectors that satisfy exac- 
tly one term of the formula. 
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2 Definitions and Terminology 

2.1 Functions and Classes 

Let X = {xi,X2, ■ ■ ■ Xn} be the set of Boolean attributes in the learning domain. 
A Boolean function is a function / : {0,1}” where —1 represents 

“false”, and +1 represents “true”. A (monotone) term U is a conjunction of 
attributes in X, none of which appear negated. Let denote the set of indices 
of attributes in ti, and let function t map the sets of indices onto terms; thus 
ti = t{mi). Let function m be the inverse of t, mapping a term onto the set 
of indices of attributes in the term. Thus m{ti) = rm. We use set notation for 
terms where the context is clear; for example, Xi G U to imply that i G mi. For 
X G {0, 1}", and a term ti, we use x ^ ti to indicate that vector x satisfies tj. 
A cross-term of a formula / is a term that contains attributes from more than 
one term of /. Let S{ti) = {x\x U}, the satisfying set of ti, be the set of 
vectors that satisfy term ti. We also refer to the satisfying set of a formula /, 
S{f), as the set of vectors that satisfy /. Let T>S{ti), the disjoint satisfying set 
of ti, denote the set of vectors that satisfy term ti, but do not satisfy any other 
term tj, i yf j. 

A Boolean formula / is a Monotone DNF (hereafter MDNF) formula if / is 
of the form f = ti 12 tg, where each ti is a monotone term. The size of 

the formula is s, the number of terms in the formula. A formula / is read- once 
if no variable appears more than once in /. An attribute Xi is read- once if it 
occurs exactly once in /. A formula / is one-read-once if for every term of /, at 
least one attribute is read-once. 

For a term t and a formula f = ti tg, we define the restriction of / 

to t, ft, by ft = t ■ ti 1 ■ tg. For a set of terms T we define the greatest 

common factor of T to be the largest term t that is contained in every term in 

T. 

We define the factorization of an MDNF formula recursively. As the base case, 
a read-once formula is a factorization (with the trivial factor of the empty set of 
attributes). Any formula that is formed as the sum of products of terms (factors) 
and a factorization of an MDNF formula is also a factorization. A Monotone DNF 
formula is said to be read-once factorable if there exists a factorization of / such 
that no attribute occurs more than once in the factorization. For example, for 
the formula / = X\X2 -\- xix^ -\- X2X3, a factorization of / is / = xi{x2 -\- X3) -\- 
X2{xi-\-x^)-\-x^{x\-\-X2). Such a representation is called a factored form of /. This 
formula is not, however, a read-once factorization of /, and indeed, no read-once 
factorization exists for /. However, the formula g = X1X2X3 -I- X1X2X4 -I- XiX^Xq 
is read-once factorable, and g = xi(x2(x3 -I- X4) -I- x^xq) is a factorization. The 
set of maximal factors of a formula / is the set of maximal greatest common 
factors over all subsets of terms in / (i.e., the set of terms t such that for some 
subset T of the terms in /, t is contained in all terms in T, and t is the largest 
such factor.) Formula g above has maximal factor set {xi,XiX2}. 

Each of Monotone DNF, Read-once MDNF, One-read-once MDNF and Read- 
once Factorable MDNF is referred to as a class of formulas. For each of the classes 
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discussed above, we may use the qualifier “poly-disjoint” to refer to the set of all 
formulas in the class for which every term ti in the formula has Pr£>\DS(ti)\ > 
, where p is a polynomial, s is the size of the formula, 0 < e < 1, and D is 
a probability distribution. 



2.2 Learnability 

We use the standard definitions for PAC-learnability in this paper. We assume 
that the reader is familiar with these (see [HKLW 88] for an excellent descrip- 
tion). Here, we give only the following definitions. Let denote the uniform 
distribution over the positive examples of the target formula /, D~ be the uni- 
form distribution over the negative examples, and D be the uniform distribution 
over the entire example space, {0, 1}". Let the positive error, e^{h) of hypothesis 
h with respect to the target formula / be the probability that h miss-classifies 
a positive example drawn according to distribution D, and similarly for e~(h). 
Let e{h) = e+(/i) -I- e~{h). For a hypothesis h, and for 0 < a < ^, we say that h 
is an a-good hypothesis, or a- approximate hypothesis, if e{h) < a. Let Pr[/Z\g] 
denote the probability that f ^ g. 

Let C and H be classes of formulas; let Cn,s be the formulas is class C with 
domain size n and size s, and Hn be the formulas in class H with domain size n. 
C is polynomially learnahle by H on the uniform distribution if and only if there 
exists an algorithm A with inputs e, S, s, and n, which Ve, <5 < 1, Vs, n > 1, and 
all target formulas / G Cn,s, outputs a representation of a hypothesis h G 
that with probability >1 — 5 has e~^{h) < e and e~{h) < e, and the run-time of 
A is bounded by a polynomial in j, s, and n. 

2.3 Fourier Transform 

We use the definitions of the Fourier transform given in [LMN 89] and [J 94]. 
For every subset A C {l,...,n} and for x G {0,1}”, we define the function 

XA ■■ {0,1}” -)> {-1,-fl}, by: xa(x) = The function xa(x) is 1 if 

the parity of the bits in x indexed by A is even, and —1 if the parity is odd. As 
is shown in [LMN 89], the set of functions xa{x) form an orthonormal basis for 
the vector space of real- valued functions on the Boolean cube ^ 2 - Thus, every 
function / : {0, 1}" -A IR can be uniquely expressed as a linear combination 
of parity functions, by / = f{A)xA, where f{A) = E[fxA]- The vector of 
coefficients / is called the Fourier transform of /. For Boolean /, / represents the 
correlation of / and XA with respect to the uniform distribution. For the results 
of this paper, we define the Positive Fourier Coefficient (PFC), which we denote 
by fr{A), as fr{A) = (— l)'"^'if_D+[XA/]- Note that since / = 1 on all positive 
examples, this reduces to fr{A) = {—1)'^^'^Ejj+[xa\- We use {—V)'^^'^Ejj+[xa\ to 
denote the estimate of f^{A). 
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3 Approximation Results for Monotone DNF 

Before giving our learnability results, we need some preliminary results on the 
approximation of MDNF formulas. The main result of this section, the Diffrac- 
tion Lemma, applies to Read-once Factorable MDNF formulas. 

The first fact we give states that every MDNF formula can be approximated 
with error bounded by | by an MDNF formula with terms of size log ^ . In the 
statement of the results of this section, we will use superscript / to denote a 
term from formula /, and similarly for g. 

Fact 1 Let f = t( + . . . + be an MDNF formula. There exists an MDNF 
formula g = tf + . . . + t^ for which \tf\ < Ig C t{ , and e“(tf) < 

The proof of this fact is given in [Ver 90a] . By FactQ for every MDNF formula 
/, there exists a formula g with log-sized terms that approximates / well. In this 
section, we give a lemma that shows how to approximate the formula g. 

Recall that the greatest common factor of a set T of terms is defined to be 
the largest term t that is contained in every term tf £ T. We refer to the set 
of examples satisfying t as the subspace defined by t. Note that the subspace 
defined by t may contain both positive and negative examples. 

The technique we will develop in the proofs of the following lemmas is to 
project a set of positive examples, or a formula, onto a larger subspace. For a 
term v, let PS{v) be the set of vectors that are zero on all Xi ^ v, and that 
range over all possible combinations of assignments to the attributes in v. We 
call the set PS{v) the projection set for v. We can then define the projection 
of a set of examples. For a set X of examples and a term v, let the projection 
function Py : — >• 2^°dl” defined as: Pv{X) = {x(By\x € X,y G PS{v)}, 

where © denotes bitwise exclusive or. Thus, the projection function Py{X) takes 
each example in X and maps it onto each of the possible combinations of 
assignments to the attributes in v, and leaves all other attributes unchanged. 
We will refer to such a projection of examples as the projection of X over v. 

We will also use the projection function Pv{g) with a formula g as an argu- 
ment to mean the projection of the satisfying set of g onto all possible combinati- 
ons of assignments to the attributes of v. Note that this is equivalent to deleting 
all attributes from g that occur in v. For example, Px^ {xiX2 + xix^) = X2 + X3. 

Let gi be the Boolean function consisting of all terms of g except tf. Thus, 
9i = t? + ■ ■ • + + tf+i + • ■ • + tf ■ As an example, consider the formula 

5 = ccia;2X3 + a;ia:2a;4 + a:ia;5a;6 (1) 

We then have: gi = X1X2X4 + XiXsXg, 52 = X1X2X3 + XiX^Xq, and g^ = X1X2X3 + 

X1X2X4 . 

In the following proofs, we will consider various properties of the product 
rii<i<s <?i- First, note that by the above definition, 

n ^ if = (^2 + ^3 + ■ • ■ + ■ (^1 + ^3 + ■ ■ ■ + ^?) ■ ■ ■ ■ • (2) 

1<2<S j^i 
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We prove the following properties showing the relationship between the functions 
gi and projections over factors of g. 

Claim 2 For an MDNF formula g = tf + ... + t^, let gi = Then 

Ol<i<s 9'^ ^^l<2<j<s ' 

Proof. First, we show that each term in X)i<i<j<s Tj is generated by n l<i<s 9i- 
Suppose that term is satisfied. All formulas gj, j i contain tf; hence 
rii<i<s 9i satisfied if gi is. Arguing in this manner for each i, we get the 
formula t? • gi + to • 02 + • ■ • + • 9s ■ Expanding each gi gives t? • (tf + 1? + . . . t® ) + 
tf • (if + 1§ + . . . tf) + . . . + tf • (tf + t® + . . . which is Ei<i<i<s ■ We 

have thus shown that every term in X)i<i<j<s^i ’ generated by n l<i<s 9ii 
and it follows that 

9i = l\- 91 + tl- 92 + + gs= ^ ■ (3) 

1<2<S 

□ 

As an illustration of Claim El we continue with the example from . We 
then get 

n = 5i • 52 • 53 

l<i<s 

= {xix^x^ + xxxy^xo) ■ {xix^xa + XxXy,XQ) ■ (X1X2X2, + xix^x^) 

= X1X2X4 ■ X1X2X3 + XiX^Xg ■ X1X2X3 + X1X3X3 ■ X1X2X4 . (4) 

Claim 3 For an MDNF formula 9 = if + . . . + tf let gi = J^jyii Then for 
any term t, U.l<^<s Pt{gi) = -Pt(rii<*<s 5*)- 

Claim 0 follows from Claim 0 and the definition of the projection function. 
The proof is not given in this extended abstract. 

We continue our example above to illustrate Claim 0 Consider the example 
from o, and let < = cci. In (gj), we have rii<i<s5* = a;iX2a:4 • xiX2a;3 + X1X5X6 • 
X1X2X3 + X1X5XQ ■ X1X2X4 = xia;2a;3X4 + a;ia;2X3X5a:6 + a;iX2a:4a;5a;6. Now, taking 
the projection over t, we get Ft ^rii<i<s 5i) = X2X3X4 + X2X3X3X3 + X2X4X3XQ. 
Taking the projection over each gi before taking their product, we get Pt{gi) = 
X2X4 + X3X3, Pt{g2) = X2X3 + X5XQ, and Ptigs) = X2X3 + X2X4, and taking their 
product gives rii<i<s Pt{9i) = {X2X4 + x^Xq) ■ {X2X3 + X3X3) ■ {X2X3 + X2X4) = 
X2X3X4 + X2X3X3XQ + X2X4X3XQ. Thus, we have n l<i<s Pti9^) = l<i<s 9i) 
for our example, as in Claim 0 

Lemma 4 (Projection Lemma). For a Read- once Factorable MDNF formula 
9 = tf + . . . + tf with maximal factor set F, let gi = J2jyti Then for any 
maximal factor t€F, ri{t»|tct3} (5») = n{t3|tct»} Pt{ 9 i)- 
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The proof of this lemma is given in Appendix^ 

We finish with our example from JQ) by demonstrating Lemma 0 Note that 
the formula of m is a Read-once Factorable MDNF, with factorization g = 
x\{x2{x^ + X4) + Thus, x\ is a factor of g. Taking the projection of each 

gi over tf, we get Pts{gi) = X4 + X5 Xq, Pt|(g2) = X3 + x^xq and Pt^igs) = 
X2X3 + X2X4. Now, taking the product of these projections over all i gives us 

I\i<i<s Ptfigi) = (x 4 + X5Xe) ■ {x3 + X5Xe) ■ {X2X3 + X2X4) = X 2 X 3 X 4 + X2X3X5XQ + 

X2X4X3Xq. We thus have that rii<i<s (5*) ~ rii<i<s -^*(5*)- 

We now state the main result of this section, which will be instrumental in 
proving the learnability of Read-once Factorable MDNF. 

Lemma 5 (Diffraction Lemma). For a Read-once Factorable MDNF formula 
g = -|- . . . -I- tf with terms of size at most Ig ^ and with maximal factor 

set F, and for the uniform distribution D on examples of g, let F' = {t £ 
A| every t C tf has Pr ulDS (tf)] < be the set of maximal factors such 

than every term tf containing a factor in F' has Pr£)[T> 5 (t®)] < Then 

{t®|tct®} ^?))] ^ 2 ■ 

Lemma 0 states that the terms in formula g that all have small disjoint 
probability weight can be approximated by a factor from the factor set F, with 
error bounded by | . We call this lemma the diffraction lemma because it shows 
that if the “power” of the formula is concentrated in the subspaces of the common 
factors, and then it “diffracts” over orthogonal sub-spaces, then these subspaces 
cannot capture much of the negative example space. 

Proof. Recall that DS{tf) is the set of vectors that satisfy term tf but no other 
term. The idea of the proof is to take each vector that satisfies exactly one 
term (i.e., is in DS(t^)) and to project it onto a polynomial number of negative 
examples. We show that if we restrict the space to vectors satisfying one or 
more of the common factors of formula g, then the projection covers all negative 
examples in the restricted space. More specifically, we show that all negative 
examples in the subspace restricted to satisfying one of more of the common 
factors of g are in Ui<i<sPts(T’ 5 (tf)). 

The set of vectors T>S{tf) (i.e., the set satisfying tf, but not satisfying any 
other term) is the satisfying set of the formula tf The projection P^s{fDS{tf)) 
is the mapping of vectors in DS{tf) onto all possible combinations of values over 
the attributes in tf. Let g^ be the formula obtained from gi by deleting all 
attributes that occur in tf. It is easy to verify that the projection Pia{DS{tf)) 

is the set of examples satisfying formula g'. Thus, Ui<i<sP (9 (T> 5 (tf)) is the set 
of examples satisfying formula g{ + g'2 P ■ ■ ■ P 9 s- Now, any negative example 
of g does not satisfy g, by definition. We can thus construct a formula that is 
satisfied by the set of negative examples not in the set Ui<i<sP( 9 (P 5 (tf )) as 
follows: 



g'lP 92 + ■■ ■ + 9'sP 9 = g'l- 92 - ■ ■ ■■ g's - 9 = g'l - 92 - ■ ■ ■■ g's-tl- ■ ■ (5) 
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Since g[ = Pt<>{gi), by Lemma 0 © gives rii<*<sff* = Y\i<t<s Pt<>{9i) = 
ntG_Fri{tf|ictf}-Pt?(5i) = ntGFri{t«|tctS}-Pt(ft)- Restricting this product to 
the space (Sjgf ' IltGF ri{ts|tct®} Pt(gi)- But this means that 

at least one of the Pt{gi) must be restricted to t, giving ri{t®|ict®} 5*- From 

©, this would require that ri{t®|tct®} 9i ' rii<i<s^i satisfiable, which is not 
possible. Since this formula represents all negative examples in the subspace 
defined by J2teF^ Ui<i<sP(9 (2?5(tf)), this implies that 

there is no negative example on this subspace that is not covered by the set 
Ui<,<,Pt9(P5(tf)). 

We have now established that every negative example that satisfies X^tGF ^ 
must belong to the set Ui<i<sPt? (P5(tf )). Now, we take the subset F' of F such 

that F' = {t G F\PTD['DS{tf)] < ^ for alH C tf}. The probability weight 
of each set Pf9{T>S{tf)) is bounded by ^ since by assumption 

PrD[VS{tf)] < and the size of the projection set PS(tf) is at most 2'°®^. 
Thus, the probability weight of the set Ui<i<sPt9 (P5(tf )) is at most Since 
all negative examples satisfying ^ contained in Ui<i<sPtn{'DS{tf)), we 

have that ^)^(EtGF'(E {ifliCt®} ^f))] — 2 ■ 

□ 

In Sect.0 we apply Lemma0in the analysis of Read-once Factorable MDNF 
formulas to develop an algorithm for learning this class. First, however, we give 
a learnability result for the class of poly-disjoint One-read-once MDNF. The 
algorithm we develop for this class will be used also in Sect. 0 



4 Learning Poly-disjoint One-read-once MDNF 

For this result, we show that for One-read-once MDNF formulas, the Positive 
Fourier Coefficient, f^{t) (defined in Sect. 12 . 311 . of a term t is related to the 
probability of the disjoint set of examples, VS{t)] thus by finding all terms with 
f^{t) > ^, we find the terms with Pr£i[T>5(t)] > ^. By doing so, we obtain a 
learning algorithm for poly-disjoint One-read-once MDNF, for polynomial p = 
_ef_ 

4s^ ■ 

The algorithm we give in this section for learning poly-disjoint One-read-once 
MDNF begins by drawing examples of the formula from the uniform distribution, 
which we will refer to as Dq. After having learned a term of the target formula 
on Dq, the algorithm will then filter examples on distribution Dq to produce a 
new distribution, D\. In general, the algorithm will learn the ith. term of the 
target formula on distribution Di-\. 

Let Dq be the initial (uniform) distribution on the examples. After each 
term of the formula is found, it is added to the hypothesis h. Let hi denote 
the hypothesis h after the term has been added. Let Di be the uniform 
distribution on examples not satisfying any term of hi. 
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The algorithm for learning a term of a poly-disjoint One-read-once MDNF 

formula on distribution Di_i is given in Fig. ^ The idea of the algorithm is as 

follows. We begin with an empty set m. For each attribute Xi such that index 

i is not already in m, we estimate the Positive Fourier Coefficient on {i} U m 

H We add the attribute with the minimal Positive Fourier Coefficient that is 

2 

of magnitude at least We continue choosing attributes according to this 

statistic until either the error of the term is small enough, or no attribute Xi 

2 

has a Positive Fourier Coefficient statistic larger than We use Algorithm 
l-Read-l(I?i_i, e, <5, s) as a subroutine to find each term of the target formula in 
Algorithm Learn- l-Read-l(e, 5, s), given in Fig. ^ 



Algorithm 1-Read- e, <5, s) 

1. Set m = 0 

2. While e~ (m) > ^ 

3. Choose the i ^ m with the minimal 

4. Set m = {£} U m 

5. return t(m) 



Algorithm Learn- 1-Read- l(e, 5, s) 

1. If e“(l) < I then 

2. Set h — 1 

3. Else 

4. Set h = 0 

5. While e+(/i) > f 

6. /i = h -I- l-Read-l(Di-i, e, 5, s) 

7. return h 



Fig. 1. Algorithm for Learning a Poly-disjoint One-read-once MDNF Formula 



In the following lemma, we show that every term of a One-read-once MDNF 
formula has a positive PFC statistic. We show this result for the uniform distri- 
bution, Dq, and will generalize it later to all distributions D^. Let {xj^ . . -Xj^} 
be the read-once attributes in terms ti, ... ,tg respectively. 

Lemma 6. If g = ti + . . . + ts is a One-read-once MDNF formula with read-once 
attributes {xj^ . . .Xj^} in terms t\,. . .,ts respectively, then for every 1 < i < s 
and every m C m{ti) such that ji € m, [(— = PrD+ . 

^ See the proof of Theorem 0 for the sample complexity required to estimate this 
statistic. 
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Proof. Note that = J2x^gi-^y""^Xm{x)D^ (x), where (x) 

is the probability that distribution Dq assigns to x. Let gi be the formula con- 
sisting of all terms in g except U. We show that for any i, and for any set m 
containing ji, Xm{x) = 0. This holds since attribute Xj. does not occur 

in gi, hence any vector satisfying gi with xj^ = 0 also satisfies gi if Xj. = 1. 
Thus, '^,^^g.{—^)''^''Xm{x)DQ{x) = 0. Note that VS{ti) is the set of posi- 
tive examples x such that x gi. Since by definition, £'^+ [(— = 
Xm{x)DQ{x), the Positive Fourier Coefficient of m is given by 
= Y.x^vs{u)i-^)'''^''Xm{x)D+{x). Now, i-iy^^Xmix) = 1 
for every vector in VS{ti), and it follows that £’£>+ [(— the Positive 
Fourier Coefficient of m, is equal to Pr jg+[DS{ti)]. □ 

By the above lemma, we have shown that for all subterms of a term of the 
target formula, the PFC statistic is equal to the size of the disjoint satisfying set 
of the term. We now show that for all cross-terms, this statistic is negative. 

Lemma 7. If g = ti + . . . + tg is a One-read-once MDNF formula with read-once 
attributes {xj.^ . . . xj^} in terms t\,. . . ,tg respectively, then for every 1 < i < s, 
every m C m{ti) such that ji € m, and any I ^ m{ti), Ejj+ [(— < 

Proof. Let Dx>s{ti) be the uniform distribution on the set of vectors in T>S{ti). 
Since the vectors in T>S{ti) do not satisfy any term tj for j fy i, Pi'n^sct.) lx£ = 
0] > i. It follows that Exe-DS(t,)(-^)'^""^^^^^'x(mu{£})(x)E>o(x) <0. □ 

We have proved Lemmas |S| and 0 above for the uniform distribution. In our 
algorithm for learning One-read-once MDNF, we will learn only the first term on 
the uniform distribution, and then skew the distribution by filtering in order to 
find subsequent terms. Recall that distribution Di is formed by filtering examples 
out that satisfy the hypothesis hi-i. In the following lemma, we show that the 
bounds shown in the above lemmas apply to all distributions Di. In fact, they 
improve on each Di by magnifying the Positive Fourier Coefficient statistic. We 
assume here that term i is learned on the ith call of Algorithm 1-Read-l. 

Lemma 8. Let be the read-once attributes in terms t\,...,ts res- 

pectively. For j > i, and any m containing some attribute in {x^^, . . . ,Xi^}, we 
have £;^+_J(-l)l™lxm] = cF;^+ [(- l)l'"lxm] for some c > 1. 

Proof. Distribution Di is formed by filtering out all examples that satisfy the 
hypothesis hi-\. Since no term in hi-i contains any attribute in {xi^, . . . ,Xi^}, 
'^x^h - Xm{x) = 0 for any m containing an attribute in {xi^, . . . ,Xi^}; thus 
= 0 and it follows that T,x^g,xi^h,_S~^^'"''xm{x) = 
J2x^g(~^y"^^ Xm{x) . Now, the Positive Fourier Coefficient is [(— = 
^Xm{x), where |S'(g)| is the number of examples satisfying 
g. The expectation on distribution Di is then given by Ejy+ [(— = 
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|S(g)|-|S(^i-i)| where S{hi_i) is the set of examples satis- 

fying hi-\. We then have the result of the lemma, that Ej^+ [(— = 

for C= |S(g)|L|s(L,_j)| • □ 

We can now state the following theorem. 

Theorem 9. The class of poly -disjoint One-read-once MDNF formulas is learn- 
able on the uniform distribution. 



Proof By Fact d any MDNF formula / can be approximated by a formula g 
with terms of size at most Ig^, with error bounded by Thus, we assume 
hereafter without loss of generality that the target formula is such a g, and show 
in the remainder of the proof that g can be approximated with error | . 

Consider the index £ chosen at the first execution of step 3 of Algorithm 1- 
Read-1 (see Fig.Q])- i is chosen to be the index of the attribute with the minimal 
estimated PFC, Ej^+ [(— that is at least ^ in magnitude. 

We first argue that the first attribute chosen will, with high probability, be 

a read-once attribute. By definition of poly-disjoint, each term in the formula 

g has Pr[X>iS(t?)] > In particular, the read-once attribute Xj^ must have 

Ejj+ [(— Any attribute xi that is not a read-once attribute 

must either occur in no term of the formula, or else occur in two or more terms. If 

Xi is not in any term of the formula, then Ej^+ [(— = 0. Furthermore, 

if £ occurs in two or more terms, then since all terms are poly disjoint, its 

2 

probability must be at least ^ greater than that of the read-once attributes of 

4 

those terms. Using Chernoff bounds [C 52, AV 79], a sample of size 0(^(log ™-|- 
loglog ^)) is sufficient to ensure with probability at least 1— ^ that X£ occurs 

in exactly one term. 

By Lemma El once we have a read-once attribute from term tf, on each 
subsequent execution of step 3, Ej^+ [(— ^ if attribute 
£ is in the same term as m. If £ is not in the same term as m, by Lemma 
0 Ejj+ [(— < 0- By the Chernoff bound argument above, a 

4 

sample of size 0(^(log ^ + log log f )) is sufficient to ensure with probability at 
least 1 that £ is in the same term as m. 

sn log j ’ 

4 

Estimating the PFC requires 0(^(log ^-|-log log j)) examples for each attri- 
bute X(. Since there are s terms in the target formula each with log j attributes, 
and we measure this statistic on all attributes each time we choose an attribute, 
the overall time and sample complexities are 0{n^ log f (log ^-|-loglog j)). The 
probability that any attribute is chosen incorrectly is bounded by 6. □ 
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5 Learning Read-once Factorable MDNF 

We now apply the Diffraction Lemma of Sect. 0to show that the learning al- 
gorithm of the previous section is also a learning algorithm for the class of 
Read-once Factorable MDNF. 

Theorem 10. The class of Read-once Factorable MDNF formulas is leamable 
on the uniform distribution. 

Proof. As in the proof of the previous section, by Fact ^ the MDNF formula 
/ can be approximated by a formula g with terms of size at most Ig with 
error bounded by so we assume hereafter without loss of generality that the 
target formula is such a g, and show in the remainder of the proof that g can be 
approximated with error |. 

We use Algorithm 1-Read-l of Fig. ^to learn a term of the Read-once Facto- 
rable MDNF formula g. Consider the index t chosen at step 3 of the algorithm. 
Either xi is a read-once attribute, or else it is an attribute from the factor of 
two or more terms that all have Pi i:i+[DS{ti)] < 

If xi is a read-once attribute, then by the same argument as in the proof 
of the previous theorem. Algorithm 1-Read-l will return term If xi is not a 
read-once attribute in term ti, then it is an attribute from the factor of two or 
more terms that all have Pijj+[DS{ti)] < By Lemma 0 all terms ti with 
factor t can be approximated by t. Since the factored form is read-once, attribute 
Xi does not occur in any other term; thus, xi is a read-once attribute in term t. 
Applying Lemmas 0 and Q as in the proof of the previous theorem. Algorithm 
1-Read-l will return the greatest common factor t. The total error incurred by 
all such approximations by common factors is bounded by |, by Lemma 0 
The time and sample complexities are as shown in the proof of Theorem 0 
0(nf^logf(logf Aloglogf)). □ 

6 Conclusions and Open Problems 

In this paper, we have given learning algorithms for two new sub-classes of 
MDNF formulas on the uniform distribution: poly-disjoint One-read-once MDNF 
formulas; and Read-once Factorable MDNF formulas. The class of Read-once 
Factorable MDNF formulas is a generalization of Read-once MDNF. Since the 
same algorithm is used for both classes, we in fact have a learning algorithm for 
the union of these two classes. 

The worst case time complexity of the algorithm is a fifth order polynomial, 
a time complexity which would be impractical for large instances. The worst 
case scenario occurs if the probability weight of the set of vectors satisfying each 
term is very small (i.e., This will not be the case for most formulas; a more 
typical case would be when each term has disjoint probability near ^ , since for a 
formula with s terms, at least one term must have probability weight at least L 
The threshold in Algorithm 1-Read-l could be adjusted for such cases to yield 
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an improved complexity. Cubic complexity would result for formulas where the 
disjoint probability is for example. Extending this idea, a rigorous average 
case analysis for the algorithm presented in this paper would be an interesting 
area of future research. 
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A Appendix 

In this appendix, we give the proof of Lemma 0 For this proof, we require the 
following additional terminology. 

Let Ri = {j\l < j < s and j ^ i} denote the set of all integers from 1 to s 
except i. We can then denote (0 by 

II 9^= E n ■ (6) 

(il,j2....,ls)etilXfl2X...Xfis l<j<s 

Proof (of Lemma We will first show that every term in Oft^itct®} Pt{9^) 
is also contained in OfUliCt®} Taking the projection of Q over t, and 

applying Claim 0 ri{t®|tct®} Pt(gi) can be expressed as: 

n Ptig^)= E mi) ■ Pt(g^) ■ 

{tfitctf} {tfltctf} 

We will show that each ■ Pt{gi) is generated by ri{ts|tcts} Ppi9i)- 

Taking the projection of m over tf, we get 

n pp^9^)= E n ■ (7) 

{t®|iCt®} (ji,j 2 , -,js)&RlXR 2 X ■■■XRs {t?|tct®} 

For any k such that factor t C consider Pt^^{gk) ■ Yl{j\tct‘> j^k} Pp-i^k)- This 
is the term of o in which jk varies over Rk, and all other ji are equal. 

We claim that n{j|tct» j^k} PpX^V) ~ PtXk)- This follows from the maxima- 
lity of t: since t is a factor of t®, and for every attribute x in t, there exists some 
term not containing x. Thus, if x € t, x does not occur in n{j|tct® j^k} Ppifl)^ 

but if X occurs in but not in for j ^ k, then x will occur in OfUliCt®} Ppifl)- 
Thus, ri{if|ictf} Pp.Xl) = PtXl)- 

Now, we have Pt^igk) ■ ri{t('|tctf} Pp^Xl) = Pt^igh) ■ PtXl). But this is equi- 
valent to Pt{gk) ■ Pt{t(f), and we have completed the proof that every term of 
Ptftf) ■ Pt{gi) is generated by 
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We now show that ri{t»|tct®} (ff*) does not generate any term not in 
Pt{tf) ■ Pt{gi)- In the above, we considered only the terms generated by Pt^igh) ■ 
ri{j|tct® which are terms in (|7I) where jk varies over Rk, and all 

other ji are equal. But ri{t!'|tct®} (di) contains terms of the form (gk) ■ 
,j=jtk,t=jtk} (t?') where P may be any value such that t C tj,, 

except k or £. (i.e., terms in o where all ji are not equal, for i ^ k.) We 
show here that all of the attributes in n{j|tct» j^k} Pt<’X't-k) niust be contained 
in ri{j|tct® j/fc (^?')- Suppose that there exists some attribute x 

in such that x is contained in all terms except t®. If there is no such attribute, 
then ri{j|ict® j^Lk e^ik} Xk) ~ Now, since x occurs in every term except 

t®, it follows that for any P^a{tX) contains x; hence for any choice of £ and 
Y\{j\t<zto. jiLk/iLk} Pto.i£k)PtlXl') = Pt{tk)- This completes the proof that the 
terms of Pt{tf) ■ Pt{gi) are the only terms generated by ri{t®|tci®} Pt^(gi)- 

□ 
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Abstract. The problem of practically feasible inductive inference of fun- 
ctions or other objects that can be described by means of an attribute 
grammar is studied in this paper. In our approach based on attribute 
grammars various kinds of knowledge about the object to be found can 
be encoded, ranging from usual input/output examples to assumptions 
about unknown object’s syntactic structure to some dynamic object’s 
properties. We present theoretical results as well as describe the archi- 
tecture of a practical inductive synthesis system based on theoretical 
hndings. 



1 Introduction 

The problem of discovering new proofs, formulas, algorithms etc. usually is solved 
by some kind of exhaustive search. One of the main issues here is how to minimize 
the extent of search by using our hypothetical knowledge about the object to 
be discovered. In this article we will concentrate our attention to synthesis of 
syntactic objects using various kinds of knowledge about them. If the objects we 
are trying to synthesize are, e. g., expressions in some fixed signature, then in 
the simplest case that knowledge will be (after assigning some interpretation to 
the signature) function values computed on some sample argument values, i.e., 
usual input /output examples. However, we want to be able to describe also some 
other properties of the unknown expression (function), i.e., treat the unknown 
function not as a black box function but as a ’’gray box” function. These other 
properties could be either some entirely syntactical properties of the expression 
we are looking for, or, taking into account also some interpretation, dynamical 
properties of the function evaluation process. 

The question we are seeking answers to in this paper is, how we should 
present our knowledge so that it would be possible to rapidly examine those 
and only those objects that match our knowledge? Roughly speaking, the aim 
of this article is to show that in some sense it is possible to perform such search 
efficiently enough. 
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Our central aim is synthesis of syntactic objects, i.e., expressions over some 
fixed signature or programs in some fixed programming language, that can be 
supplemented by semantic interpretation. 

It was understood already long ago that it is convenient to describe such 
syntactic objects by means of context-free grammars. Then a grammar generates 
a language with strings belonging to this language being our syntactic objects. 

The well-known notion of attribute grammars is linked with the notion of 
context-free grammars. Our central observation that our approach is based on is 
that various kinds of hypothetical knowledge about the unknown object usually 
can be described by means of an attribute grammar that is based on the context- 
free grammar defining the description space. 

As an example, suppose that our description space is defined by a context- 
free grammar describing the language of simple arithmetic expressions. Then, 
by supplementing each nonterminal with just a single attribute, we can make 
an attribute grammar that, for example, counts the number of multiplication 
operations in expressions, or that can be used to calculate the value of the 
expression when variable values are fixed, or that limits the values of intermediate 
expressions. See |3j for more details. 

The main problem we will solve is the following: if some attribute grammar is 
given, is it possible to efficiently enumerate the corresponding language without 
considering strings that do not belong to it? The aim of this article is to show 
that, if some conditions hold, it is possible. 

The results discussed in this article generalize the results presented in 0, 
which in turn was generalization of results presented in ^ and |2j. In we 
considered only attribute grammars with synthesized attributes, and these at- 
tributes had to be independent, i.e., equations for computing one attribute could 
involve only the same attribute of the production right-hand side symbols. In 
this article we show that a more general class of attribute grammars can be 
considered, having possibly also inherited attributes and dependencies between 
different attributes. In some sense the presented article can be regarded as the 
concluding article in the series that started with PJ. 

Other approaches to synthesis of expressions include discovery systems BA- 
CON (P, |H]), genetic programming (P, jS], p]). 

2 Definitions and the Main Result 

We suppose that an attribute grammar associates constant values with terminal 
symbols. For nonterminals attribute values are evaluated by means of corre- 
sponding functions. Every production of the grammar has one function for each 
synthesized attribute of the left-hand side nonterminal and one function for each 
inherited attribute of each of the right-hand side nonterminal and terminal sym- 
bols. If there is a production with several instances of the same nonterminal 
on the right-hand side, we would write the expression defining the function for 
computing attributes like this: 

S i — AA ! dg i — — ^A2 
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In article we used conditions — binary predicates that could be attached 
to every production — , and only inference trees with every predicate being 
true for every tree node were acceptable. Here predicates will be modeled by 
partially defined functions, and only inference trees with all attributes defined 
will be acceptable; see below for more detailed explanation. 

By language that is defined by such a grammar we will understand the set 
of all strings that can be inferred by means of acceptable inference trees. 

There is some domain associated with every attribute. From now on we will 
consider only attributes with finite domains D (for the sake of simplicity we will 
assume that these domains are subsets of IM). That means that arguments of 
functions for computing attributes, as well as their results belong to D. As we 
already mentioned, these functions can be partially defined as well. We will say 
that a grammar is finite if its attributes are of finite domain. 

In the following discussion, in order to avoid talking about complexity of 
attribute evaluation functions, we will assume that each function value, given 
function arguments, can be computed in constant time. 

We will say that an algorithm enumerates a language in setup time T and 
fth step time T^, if this algorithm outputs the first string w\ in time T + Ti, 
and the ith string Wi (f = 2, 3, . . .) in time Ti from the moment when outputting 
the previous string Wi-i was finished. In this paper by algorithm we mean a 
RAM-machine. 

Now let us repeat some definitions in a more formal manner that would be 
convenient for presenting our results. 

Let G = (T, N, P, S) is a context-free grammar, where: 

— r — finite terminal symbol set; 

— N — finite nonterminal symbol set; 

— S G N — start symbol; 

— P — a finite production set, where each production (1 < fc < |P|) is in form 

Afc ^ Bk,iBk ,2 ■ ■ ■ Bk,s{k), where Ak G N and Bk,i G NUT 

By trees we will denote structures of the form 

— (Ti), where Ti G T, or 

— {Ni, Ki, K 2 , ■ . ■ , Kn), where Ni G N and Kj are trees. 

We will say that a tree K corresponds to a terminal symbol Ti, if AT = (TJ. 
We will say that a tree K corresponds to a production Ni G- Bk,iBk ,2 ■ ■ ■ Bk^s{k)t 
if K = {Ni,Kk^i, Kk ,2 ■ ■ • Kk^s{k))i where trees correspond to symbols Bk^i- 
We will say that a tree AT corresponds to a nonterminal Ni, if it corresponds to 
some of productions Ni G- Bk^iBk^2 ■ ■ ■ Bk^s(k)- 

Now we will define the terminal string w{K) corresponding to a tree K: 

w{{T,)) = T, 

w{{N,, KuK 2 ,..., K^)) = w{K^)w{K 2 ) . . . 

A string w G T* can be inferred in grammar G, if there exists a finite tree Kyj 
corresponding to start symbol S such that w = w{Kuj)', such tree will be called an 
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inference tree of w. The set of all strings that can be inferred in G will be called 
language L(G). The grammar G is unambiguous, if for every string w G L{G) 
there exists only one inference tree Kw By depth of a string w G L(G) we will 
denote the depth of the corresponding inference tree. 

Each of the elements Ck G iVUT may have several attributes domains of 
all attributes are finite subsets of natural numbers. Values of attributes assigned 
to T elements (terminals) are constants, while values of N (nonterminals) are 
computed, using attribute evaluation functions. We will denote by Ck the tuple 
of all attributes of Gk, i.e., {ck,o, ■ ■ ■ , Ckj)- 

There will be two kinds of attributes: synthesized and inherited attributes. 
We will denote synthesized attributes by and inherited attributes by c*"^. 

Every production has several attribute evaluation functions assigned to it: 
if the production is in form Gi^ Gi^ Gi^ . . . Ci^ , then there is a correspon- 
ding function for each synthesized attribute of Gi^ and for each inherited at- 
tribute of Gi^Gi^ ■■ - Gi-. For a synthesized attribute the evaluation func- 
tion is f{cig, . . . for an inherited attribute the evaluation function is 
i.e., synthesized attributes can depend on all other attributes in the 
production, while inherited attributes can depend only on other attributes of 
the same symbol as well as on attributes of the left hand side symbol of the 
production. 

Although there could be the same function used as the attribute evaluation 
function for several attributes (e.g., the identity function), we will assume that 
each attribute has its own, separate evaluation function (possibly partially de- 
fined). If we regard some evaluation function as defined by means of a table 
where there is a separate row for each possible argument tuple together with the 
corresponding function value, then by function volume we will understand the 
number of rows in this table. By production volume we will understand the sum 
of volumes of all attribute evaluation functions attached to this production. By 
volume of the grammar we will call the sum of all grammar production volumes. 

A context-free grammar G that is supplemented with attributes and attribute 
evaluation functions will be called attribute grammar and denoted by G~^ . 

A tree (Ti) has the same attributes as the terminal symbol Ti, and attribute 
values are the same constants. A tree {Gig, Ki^, Ki^ . . . Ki.) that corresponds to 
the production Pi = {Gig ^ Ci^Gi^ . . .Gi^) has the same synthesized attributes 
as nonterminal Gig, and attributes are evaluated by first evaluating the values 
of attributes , . . . , that are assigned to , . . . , Ki^ and then by using the 
corresponding functions attached to production Pi. Similarly, subtrees Kj have 
the same inherited attributes as nonterminals Gj. 

We will say that a string w can be inferred in grammar G"*", if 

— it can be inferred in G, and 

— it is possible to compute the values of all attributes assigned to the inference 
tree and each of its subtrees (it is not always possible, because attribute 
evaluation functions are partially defined). 

By language L{G~^) generated by an attribute grammar G"*" we will understand 
the set of all strings that can be inferred in G+. 
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Theorem 1. There exists an algorithm which, having received an arbitrary finite 
unambiguous noncircular attribute grammar, enumerates without repeating the 
corresponding language in setup time 0{\G'^\'^), where |G+| is the volume of the 
grammar and k is the maximal number of attributes belonging in the grammar 
to a single symbol, and ith step time 0(|wi|), where Wi is the string output in 
the ith step. 

3 Sketch of Proof 

We will define the grammar graph Gg+ corresponding to the grammar G~^. It 
will contain two kinds of nodes: terminal and nonterminal symbol nodes and 
production nodes. 

There will be a symbol node in Qq+ corresponding to each triple (G,a,v), 
where G G NUT, a is some attribute belonging to C and v is some value of this 
attribute. Production nodes will correspond to table rows (equalities) that define 
G^ attribute evaluation functions: if the function / that is defined for computing 
attribute c of production Gt^ G- Gi^ Gi^ . . . Gi. is defined by n rows of the form 
f{ci^^,Ci^, . . . ,Ci-) = c, each of these rows will have a corresponding production 
node p. For each p there will be the following arcs in Gq+ as well. There will be 
an arc from p to symbol nodes (Gig,Cio where Ci^ ,. is an attribute present 
in the definition of the attribute evaluation function corresponding to p and v 
is some value of this attribute that is used in the specific table row for p. These 
symbol nodes will be called upper nodes of the production node p. There will also 
be arcs from every node (Gi^ , Ci^ j, , v) (m > 0) top where Ci^ j, is some attribute 
present in the definition of the attribute evaluation function corresponding to p 
and v is some value of this attribute that is used in the specific table row for p; 
these symbol nodes will be called lower nodes of p. 

Now we will define the compressed grammar graph Cq+ that corresponds to 
the grammar graph Qq+ . It will also contain symbol nodes and production nodes. 
In this graph symbol nodes will contain pointers to symbol nodes in grammar 
graph Gg+j ^cid similarly production nodes will contain pointers to production 
nodes in the grammar graph Gg+ ■ For each symbol node in Cg+ there will be as 
many pointers to symbol nodes in Gg+ there are attributes attached to the 
corresponding symbol in G“*", and for each production node in Cg+ the number 
of pointers to symbol nodes in Gg+ will equal the number of attribute evaluation 
functions attached to the corresponding production in G^. 

The compressed grammar graph Cg+ will contain the following nodes: 

— For each terminal symbol in G~^ there will be a single terminal symbol node in 
Cq+ with pointers to corresponding terminal nodes in Gg+ (for each attribute 
c of G G T there is only one node (G, c, v) in Gg+)- 

— If, for some production P with k attached attribute evaluation functions, 
there are k production nodes Pi in Gg+ such that for every attribute c of the 
right hand side of P and for all lower nodes of pi corresponding to c there 
is a symbol node Si in Cg+ with pointers to all these nodes, than there is a 




Using Attribute Grammars for Description 405 



production node in Cq+ with pointers to all nodes pi, and there are arcs in 
Cq+ from all Si to this production node. 

— If there is a set of symbol nodes Si in Qq+ such that there is a production 
node p in Cq+ containing pointers to all lower nodes of Si, then there is a 
symbol node in Cqjt containing pointers to all Si nodes, and there is an arc 
from p to this symbol node. 

For nodes of Cq+ (not necessarily for each of them) we will define weights: 

— the weight of terminal symbol nodes is 0; 

— if a production node p has lower nodes Si, . . . , the weight of this node 
equals max(r<;si , . ■ . , Wsf , ) + 1, where Wsi are weights of p lower nodes; 

— if pi, . . . ,pk are production nodes with a common upper node s whose 
weights are defined and are equal to Wp ^ , • ■ • , Wp^. , then the weight of s equals 
min(r<;i, . . . , Wk), otherwise the weight of s is not defined (i. e., if there is no 
such production node). 

Graph Cq+ will be augmented by dotted arcs according to the following rule. 
Assume that s is some symbol node and pi, ... ,pk are production nodes with 
defined weights w{pi) such that their upper node is s, and they are ordered so 
that w{pi) < w{p2) < • • • < w{pk). Then dotted arcs go from s to p\, from pi 
to p2 ,. . from pfc_i to pk. 

By C we will denote some symbol of G, and by u = (wi, . . . ,Vk) — some 
vector of its attributes’ values. It is easy to see that the weight of the symbol 
node in Cq+ that corresponds to C and has pointers to nodes {C,cj,Vj) in Qq+ 
equals the depth of the most shallow inference tree that corresponds to C whose 
attributes’ value vector is v. 

Grammar graph Qq+ construction. Gan be performed in time 0(|G’''|), be- 
cause in Qq+ the number of production nodes equals |G+| and each production 
node can be added in constant time. 

Graph compression. The compressed graph Cq+ will be constructed in several 
stages. During the first stage for each grammar symbol two attributes will be 
compressed, obtaining a partial Cg+ with every node containing no more than 
two pointers to nodes of Gg+- the consequtive stages other attributes will 
be added, until the full Cg+ is obtained. For the sake of simplicity here we will 
briefly consider only how the first stage is carried out. The algorithm will consist 
of the initial step and iterative step. 

Initial step. Terminal symbol nodes of Cg+ are constructed. Weights 0 are 
assigned to these nodes. 

Iterative step. Gonsider all pairs of production nodes pi and P2 in Qg+ such 
that no corresponding node in Cg+ is constructed yet, but for all lower nodes in 
Gg+ there are matching nodes in Cg+. Then a production node p in Cg+ with 
pointers to pi and p2 can be constructed and its weight can be computed. If 
there is a symbol node s in Cg+ with pointers to upper nodes of pi and p2, a 
dotted arc is added in Cg+ from the last production node on the path formed 
by dotted arcs leaving s to p. Otherwise such s is constructed, and a dotted arc 
is added from s to p. 
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The j-th stage of graph compression can be performed in time 
hence we get the time complexity estimation of 0(|G“'"|*) where k is the maxi- 
mum number of attributes attached to a single symbol in G+. 

In the terms of Theorem 1 grammar graph construction and graph compres- 
sion comprise the setup stage, therefore we have shown that time complexity of 
the setup stage is 0(|G+|^). 

String output. The inference tree K of a string w G L(G+) will be called 
annotated if there is a node of graph Cg+ associated with every K subtree Ki ac- 
cording to the rule that, if Ki is in form (S), where S' G T, or (S, Slip, • ■ • j Ki^n), 
where S G N, and the value of its attribute vector is v, the associated graph 
node is (S, v). 

We will define the minimal subtree of a symbol node s in Cg+ that corre- 
sponds to symbol S. If S G T, then the minimal subtree is (S). If S G and 
there is no dotted arc leaving s, the minimal subtree is not defined for this node. 
If S G and there is a dotted arc leaving s, then we have to consider the 
production node p that this arc enters. If this production node corresponds to 
the production P ^ Si .. . Sn, then the minimal subtree of s is {S, Ki, . . . , Kn), 
where Ki are minimal subtrees of p lower nodes. If for some node the minimal 
subtree is defined, it is a finite object with depth equal to the weight of this 
node. 

Similarly we define the minimal subtree of a production node of Cq+ . It will 
be defined only for production nodes with defined weights. The minimal subtree 
of a production node will be equal to the minimal subtree of its upper node. 
The depth of minimal subtrees of production nodes are equal to their weights 
as well. The number of steps necessary for outputting the terminal string w{K) 
that corresponds to the minimal subtree K of some node is 0(|'u;(Ar)|), where 
\w{K) \ is the length of this string. 

We will say that for an annotated inference tree K there exists an alternative 
inference tree, if there is a dotted arc leaving the production node p correspon- 
ding to K and entering some production node p' . Then the minimal subtree of 
node p' will be called alternative inference tree for tree K. 

The language L(G^) that the presented algorithm enumerates can be infinite, 
therefore enumeration will be performed in a breadth-first manner. A potentially 
infinite queue will be used for storing marked inference trees. By marked infe- 
rence trees we will understand annotated inference trees with (possibly) marked 
subtrees for which there exist alternative inference subtrees. 

Assuming that Cg+ contains only one node s corresponding to the start 
symbol S of grammar G"*", the algorithm for enumerating will be as follows. 

Initial step. 

— The terminal string w{K) that corresponds to the minimal subtree of the 

node s is output. 

— A" is entered into the queue, marking all subtrees for which alternative sub- 
trees exist. 

Iterative step. While the queue is not empty: 
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~ Take the first inference tree K from the queue. 

— If at least two subtrees of K are marked, enter K at the end of the queue, 
removing the first marker from the left-hand side. 

— Replace the first marked subtree of K by its alternative inference subtree, 
obtaining inference tree K' . 

— Output w{K'). 

— If in K' there is a subtree to the right of the changed point with an alter- 
native, put K' at the end of the queue, marking all alternative points from 
the changed point (including) to the right. 

In the general case when graph Cq+ contains n nodes Sj corresponding to the 
grammar start symbol S', n separate queues are set up. The words for which the 
value of their inference tree corresponds to si are output on the first, n + 1-st, 
2n + 1-st etc. steps of the algorithm, the words for which that value corresponds 
to S 2 are output on the second, n + 2-nd, 2n -I- 2-nd etc. steps, etc. 

4 Notes on Implementation Details 

The described algorithms are being implemented in a practical inductive in- 
ference system. Here we will shortly describe its architecture. For purposes of 
practical implementation some deviations were made compared to the theoreti- 
cally ’’clean” algorithms. 

There is a separate module of grammar graph construction and a separate 
module of graph compression in the system. According to the theoretical algo- 
rithms these two modules have to work sequentially in the order that they were 
just mentioned in. However, we have noticed that for some search spaces it is 
more efficient to start compressing the graph before it is fully constructed. To 
be able to organize synthesis process in such manner we implemented a con- 
trol module which acts as a dispatcher between graph constructor and graph 
compressor, that can be easily customized for different dispatching strategies. 

For real world examples domain node and production node sets become quite 
large, and we have experimented with several strategies of compressing these 
sets. In the compression process several domain nodes are merged into a single 
node. When such compression takes place, the graph loses precision in the sense 
that it encodes some strings that do not belong to the language. However, if 
we add to the grammar some more information, e.g., additional input/output 
examples, incorrect paths through the graph are filtered out. In the case of 
input /output examples graph nodes can be compressed more safely if they have 
values further from zero. For experimentation purposes we have implemented 
a separate domain node writer module that can be easily changed to support 
different node compression strategies. 

A separate module implements graph cleaning procedure. According to our 
theoretical algorithms graphs are constructed and compressed in a bottom-up 
manner, and language strings are output in a top-down manner. Compressed 
graphs can be made smaller by traversing noncompressed graphs in a top-down 
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manner and registering which nodes are reachable; then during compression only 
reachable nodes are considered. We call this process graph cleaning. 

At present inductive inference system implementation is in progress. 

5 Conclusion 

There were successful computer experiments carried out that employed methods 
developed in |21. In these experiments algebraic expressions were syn- 

thesized from input/output examples. In the most successful experiments the 
formula for the volume of a frustum of a square pyramid and the formula for 
finding roots of a quadratic equation were synthesized in reasonable time. Algo- 
rithms and the synthesis system described in this paper are general and flexible 
enough to permit easier setup of computer experiments. We believe that com- 
puter experiments can bring new insights into inductive synthesis process in our 
framework and help in obtaining new theoretical results. 
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Abstract. The present paper deals with inductive inference of recur- 
sive functions, in general, and with the problem of validating inductive 
learning devices, in particular. 

Thus, the paper aims at a contribution to the research and develop- 
ment area of intelligent systems validation. As those systems are typi- 
cally interactive and, therefore, utilized in open loops of human-machine 
interactions, the problem of their validity is substantially complicated. 
A certain family of validation scenarios is adopted. Within this frame- 
work, we ask for the power and the limitations of these validation ap- 
proaches. The expertise necessary and sufficient to accomplish successful 
validation is of some particular interest. One of the key questions is for 
the comparison of domain expertise and validation expertise. 

The area of inductive inference of recursive functions is taken as a case 
for complex interactive systems validation. 

Computability theory is providing a rich source of theoretical concepts 
and results suitable for the focused investigations. Emphasis is put on 
explicating the importance of abstract computational complexity, lim- 
iting computability, and relativized computability. These concepts are 
exploited for characterizing the expertise necessary and sufficient in the 
validation of inductive inference systems. Particular emphasis is put on 
relating validation expertise and domain expertise by means of rela- 
tivized computability concepts. One of the key results on validation of 
inductive learning systems exhibits that validation expertise necessarily 
implies the expertise for solving the focused learning problems. 



1 Motivation 

The focus of the present paper is on inductive inference systems, but we draw 
a particular motivation from cmother area: complex interactive systems valida- 
tion. u There is cm obvious necessity to validate cmd verify complex systems, 
respectively. It might easily happen that ...the inability to adequately evaluate 
systems may become the limiting factor in our ability to employ systems that our 
technology and knowledge will allow us to design, (cf. [12]) 
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Unfortunately, there are numerous severe ciccidents bearing abundant evi- 
dence for the truly urgent need for complex systems validation. Besides spec- 
tacular Ccises, daily experience with more or less invcilid systems is providing 
paramount illustrative examples. Progress in the area of validation cmd verifica- 
tion of complex systems requires both discipliuciry results and solutions in the 
humcmities including cognitive psychology, e.g. Even socicil and political aspects 
come into play. The authors refrain from cm in-depth discussion. 

Following [4] cmd [10], vcilidation is distinguished from verification by the 
illustrative circumscription of dealing with building the right system, wherecis 
verification deals with building the system right. The prototypical application 
area considered in the present paper is systems validation, which - according 
to the perspective cited above - is less constrained and less formalized them 
verification. 

Assume computer systems which are designed cmd implemented for cm inter- 
active use to assist humcm beings in open loops of humcm-machine interactions 
of a usucilly unforeseeable length. The vcilidation task is substantially compli- 
cated, if it is intermediately undecidable whether or not some humcm-machine 
co-operation will eventually succeed. 

Nontrivicil learning problems, for instance, are quite typiccil representatives 
of such a class of problems attacked through complex and usually time con- 
suming sequences of human-machine interactions. Knowledge discovery in data 
bases, for instcmce, is a practiccilly relevant application domain for those learning 
approciches. 

For assessing those systems’ validity, there have been proposed validation 
scenarios of several types (cf. [9], e.g.). As soon as human experts are involved 
in the implementation of vcilidation scenarios, there cirises the problem of the 
experts’ competence. An in-depth investigation of vcilidation sceucirios, of their 
appropriateness for certain classes of target systems, and of their power and 
limitations involves inevitably reasoning about the experts’ competence. 

Still informcilly speaking, the key question is how to characterize the human 
expertise necessary or sufficient for validating certain AI systems. 

The issue of human expertise is usucilly understood a problem of cognitive 
sciences (cf. [5]). This is complicating a thorough computer science investigation 
of validation sceucirios mostly bcised on formal concepts and methodologies. 

Therefore, the present papers is focusing on approaches to characterize hu- 
man expertise in formal terms. This is deemed a substemtial step towcirds a better 
understanding of the power and limitations of interactive validation scenarios. 



2 Learning Systems Validation - Basic Concepts 

We adopt vcilidation sceucirios according to [9], e.g. Validation is performed 
through the essential stages of test Ccise generation, experimentation, evalua- 
tion, and assessment. 

Test Ccises cire generated in dependence on some intended tcirget behaviour 
and, possibly, with respect to peculicirities of the system under vcilidation. Inter- 
active validation is performed by feeding test data into the system and, hopefully, 
receiving system’s response. The results of such an experimentation are subject 
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to evaluation. The ultimate validity assessment is synthesized upon the totality 
of evaluation outcomes. 

Human experts who are invoked for learning systems validation within the 
framework of those scencirios need to have some topiccil competence. It is one 
of the key problems of validation approaches based on human expertise how to 
characterize the experts’ potentials which cillow them to do their job sufficiently 
well. Even more exciting, it is usucilly unknown whether or not the humcms 
engaged in those interactive scencirios can be replaced by computer programs 
without cmy substantial loss of validation power. This problem is of a great 
philosophical interest and of a tremendous practical importance. 

For the validation of inductive inference systems, we will be able to char- 
acterize the human expertise sufficient for trustable systems validation. Some 
characterizations are even both sufficient and necessary. Thus, this paper is 
understood a contribution to the theory of inductive inference. 



2.1 Preliminaries 

For most of the notions cmd notations of this section, [11] is a standard reference. 
Let IN denote the set of natural numbers, and let JVj_ = JVU{T}. For any M C JZV 
we denote the power set of M by p{M). For any k>l, Tp denotes the set 

of all partial (total) function from IV* into IV. For some function /, dom(f) 
denotes the domain of /. 

Computable functions are defined over IN. V is the clciss of all partied recur- 
sive functions. The class of total recursive functions is denoted by TZ. 

By cod : IN'^ —>■ IN let us denote Cantor’s pairing function, i.e. a particularly 
simple primitive recursive function that is bijective (injective cmd surjective). 

For a Godel numbering if, ecich number jGlN is specifying a particular 
function denoted by (fj. For the rest of this paper, a GoDEL numbering p> and a 
corresponding Blum complexity measure 4> are fixed (cf. [3]). For any j,x gIN, 
(pj{x)[ indicates that Pj{x) is defined. For some set FCV, the index set Ip 
contains exactly all progreuns for functions from F, i.e. Ip = \^i£lN\(pi£ T}. 

Let U CIZ. Then, U is Sedd to be enumerable provided there is & g^IZ such 
that U C {</3g(„) I n e IV} <ZTZ. li U = {Pg(n) I ^ G for some g £IZ, then U is 
called excictly enumerable. By NUM (NUM!) we denote the collection of all 
enumerable (exactly enumerable) subsets of IZ. 

A sequence {nt)teiN of natured numbers is Sedd to converge to some ultimately 
final value n, if past some point t' all numbers (t > t') are identical to n. This 
so-called discrete limit is denoted by \im{nt)teiN = n. 

A function / is said to be limiting computable, if there is some g € IZ"^ meeting 
(i) for all X £ dom(f), there is some t' £lN such that, for all t£lN with t>t\ 
g{x,t) = f{x), and meeting (ii) for all x ^ dom(f) and all t £ IN, there exists some 
t' £ IN such that t' > t, and g{x,t') ^ g(x,t). 

Any M C IN is Sedd to be limiting decidable, if M’s characteristic function 
Xm is limiting computable. Similcirly, M C IN is Sedd to be limiting enumerable, if 
‘hedf’ of M’s chciTcicteristic function xtr limiting computable, where xtri^) ~ f 
if and only if x£M . 

We use the abbreviation to indicate that / is computable relative to some 
oracle A, i.e. there is an algorithm computing / that is cdlowed to cisk, from time 
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to time, question of the type ‘n € A?\ and that may use the answers supplied to 
determine how to continue. 

We use the abbreviation [M l.r.e]^ to indicate that there is some function 
computable relative to some oracle A (^.-computable, for short) limiting enu- 
merating some set M C M. 



2.2 Inductive Inference Notions and Notations 

Induction constitutes an importcmt feature of learning. The corresponding theory 
is Ccilled inductive inference. Inductive inference may be chciracterized as the 
study of systems that map evidence on a tcirget concept into hypotheses about it. 
The investigation of scenarios in which the sequence of hypotheses stabilizes to an 
accurate and finite description of the tcirget concept is of some particular interest. 
The precise definitions of the notions evidence, stabilization, cmd ciccurcicy go 
back to Gold (cf. [6]) who introduced the model of lecirning in the limit. 

This section is focused on essential features of inductive learning which com- 
plicate the validation task, cmd it introduces a few basic formalisms. For both 
conceptual simplicity cmd expressive generality, the focus of the present investi- 
gations is on learning of total recursive functions from finite sets of input /output 
examples (cf. [2]). 

When learning any totcil recursive function /, the input/output excunples 
(0,/(0)), (1, /(!)), (2,/(2)), ... are subsequently presented. Learning devices are 
computable procedures generating hypotheses upon natural numbers f[t] en- 
coding finite samples (0,/(0)), (1, /(!)), ..., {t,f{t)). Note that, for every x£M, 
there is the one and only finite sample encoded by x. 

For notationcil convenience, hypotheses cire just natural numbers which are 
to be interpreted via the underlying G5 del numbering {p. 

Note that learning will usually take place over time. Thus, hypotheses are 
generated subsequently. 

An individual learning problem is always understood to be a class of target 
functions. A corresponding learning device has to learn each of these functions 
individucilly when fed with appropriate samples. 

Definition! (LIM). U £LIM iff there is cm S gV satisfying for cmy f GU: 
(1) for all t e IN, ht = S{f[i\) is defined and (2) = h with iph = f exists. 

Thus, LIM is a collection of function classes U for which some recursive 
learning device S cis indicated exists. As usual, by LIM(S) we denote the function 
class learned by S. For some SCV, we set LIM{S) = {LIM{S) If the 

learning device S exclusively outputs indices for total recursive functions then 
U belongs to the lecirning type TOTAL. 

Definition 2 (TOTAL). U e TOTAL iff there exists some S gV satisfying for 
any f £U: (1) for all t € IN, ht = S{f[i\) is defined, (2) \im.{ht)teiN = h with <Pk = f 
exists, cmd (3) for all t € IN, ht G In- 

Alternatively, if it is decidable whether or not S, when lecirning cmy f GU, 
has reached the ultimate learning goal then S witnesses that U belongs to the 
specicil learning type FIN. This approach is easily formalized as well: 
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Definitions (FIN). U e FINiE there exist some S gV and some related deci- 
sion procedure dGV which satisfy for any f GU: (1) for all t € IV, ht = S{f[i\) is 
defined, (2) \im{ht)teiN = h with tph = f exists, (3) for cill t € IV, d(f[i\) is defined, 
and (4) for all t£]N, d{f[t]) = l iff S{f[t]) = h. 

The relation between the lecirning types introduced above is cis follows: 

FIN c TOTAL c LIMc pilZ). 

To sum up, although inductive learning succeeds after finitely many steps, in 
its right perspective, it is appropriately understood as a limiting process. This 
fcict is causing unavoidable difficulties to validation attempts bcised on local 
information, only. 

2.3 Interactive Scenarios for Learning Systems Validation 

A vcilidation problem for inductive inference systems is given as a triple of 
(1) some function class U CIZ, (2) some learning function SgV, £md (3) an 
inductive inference type like LIM, TOTAL, or FIN, e.g. The precise question is 
whether S is able to learn all functions / from U with respect to the considered 
inductive inference type. 

There cire two substantial difficulties. First, function classes U under consid- 
eration are usually infinite. Second, every individual function is an infinite object 
in its own right. In contrast, every human attempt to validate some lecirning sys- 
tem by a series of systematic tests is essentially finite. Thus, validity statements 
are necesscirily approximate. 

When some process of (hopefully) learning some target function / € 17 by 
some device S gV with respect to some inductive inference type is under progress, 
one may inspect snapshots determined by cmy point t in time. 

Any pair of cm index of a recursive function cmd a time point is called test 
data. Those pciirs represent initial segments of functions. Certain data are chosen 
for testing by a test data selection. 

Definition 4 (Test Data, Test Data Selection). Any pair (j,t) that meets 
(fj(x)l, for all x<t, is Ccilled test data. ID denotes the set of all potential 
test data. Furthermore, a function Ds :IN—i-p{TD) defines a test data selection 
provided that, for all n € IV, Ds{n) C Ds{n + 1). 

In practice, the selection of test data is frequently done by hcmd. So, there is 
no need to consider the test data selection to be recursive. 

Intuitively, the two numbers refer to a program and an intensity, with which 
the behaviour of the system is tested for this program. Test data (j,t) are inter- 
preted as p>j[i\. Therefore, the second parameter is called a time stamp. 

In order to verify whether or not a lecirning system is vcilid with respect to 
some function class U, enough relevant test data have to be selected. 

Definition 5 (Completeness). Let U CIZ and let Ds be a test data selection. 
Ds is Sciid to be complete for U iff the set T = Unew-^'®(^) satisfies conditions 

(1) for all /gU, there is a </5-index j for / such that (j,t) £T for every tGiN, 

(2) there are only finitely many test data (j,t) € T with j ^lu, and (3) there are 
only finitely mcmy test data (j,t) £T with {j,t + l) 0T. 
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When testing some lecirning system S on test data (j,t), one is interested in 
knowing how S behaves on input (fj [t] . Experimentation means feeding test data 
to the system under investigation and, if possible, receiving system’s response. 

Definition 6 (Experimentation). Any total mapping : IV ^ JZVl is Sciid 

to be an experimentation. The mapping Expmt is an experimentation for S gV 
iff, for all {j,t) e TD, either Expmt{ipj[i\)=l. or Expmt{Lpj[i\) = S{ipj[i\). 

Because experimentation is a human activity, the mapping Expmt is not 
necessarily computable. 

Intuitively, the result T means that no proper system’s response has been 
received. This may be due to some time out, e.g. Clearly, if it frequently happens 
that Expmt{(pj[t\)=E, but S{ipj[i\) terminates, then this particular experimen- 
tation does not reflect the learning system’s behaviour sufficiently well. 

Insistency chcircicterizes a manner of interactively validating a system where 
the human interrogator does never give up too early. 

Definition 7 (Insistency) . Let Expmt be an experimentation for S gV. Expmt 
is Sciid to be insistent for S iff Expmt{(pj[t\) = S{(pj[t\) for excictly all {j,t) € TD, 
where S{ipj[t]) [. 

These formalisms cire aimed at the description of cm expert’s intercictive val- 
idation of any given lecirning system S. An expert is performing experiments 
with some target object (fj in mind resulting in protocols. A protocol is a triple 
{j,t,h) with {j,t) e TD and h= Expmt(ipj[t]). 

Those protocols are subject to the expert’s evaluation marked 1 or 0, re- 
spectively, expressing the opinion whether or not the experiment witnesses the 
system’s ability to learn the target function (fj. This realizes a certain mapping 
Eval: TD X IN —>■ {0,1}, a so-Ccilled expert’s evaluation function. As before, this 
might be not computable. The tuple consisting of a protocol and the expert’s 
evaluation is a report. 

Validation statements cire synthesized upon reports which reflect interactive 
systems validation to some extent. In dependence on the underlying validation 
scenario, there are concepts of different sophistication. We adopt the most simple 
approcich, and consider any finite set of reports to be a validation statement. 

For interactive systems, in general, cmd for learning systems, in particular, 
any one-shot validation does not seem to be appropriate. Thus, one is led to 
vcilidation sceucirios in open loops which result in sequences of validation state- 
ments. Hence, a validation dialogue arises, constituted by any test data selection, 
experimentation, and the expert’s evaluation function. 

Definition 8 (Validation Dialogue). Assume any test data selection Ds, any 
experimentation Expmt, and any expert’s evaluation function Eval. The triple 
VD = {Ds, Expmt, Eval) defines a sequence of validation statements {VSn)neM 
called a validation dialogue, where, for all n € IN, VSn is the collection of all 
reports {{j,t,h),b) with (j,t) G Ds(n), h= Expmt((pj[t]), and b= Eval(j,t,h). 

Such a Vcilidation dialogue is said to be successful for U and S if and only if 
the underlying data selection is complete for U, the experimentation is insistent 
for S, and the experts’ evaluation is converging to the success value 1, for every 
program which is subject to unbounded experimentation. 
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Definition 9 (Successful Validation Dialogue). Let U CTZ and S £V. Fur- 
thermore, assume cmy test data selection Ds, any experimentation Expmt, and 
any expert’s evaluation function Eval. The vcilidation dialogue {VSn)neiN de- 
fined by VD = {Ds , Expmt, Eval) is successful for U and 5 iff (1) Ds is complete 
for U, (2) Expmt is insistent for S, and (3) for every j € IV, there are only finitely 
many reports G Unew with 6 = 0. 

The formal concepts introduced will suffice for systematiccilly investigating 
the possibilities of interactive learning systems vcilidation. 

3 Learning Systems Validation - Results 

Within the preceding section, a formalization of a genercil scenario for learning 
systems vcilidation has been presented. It mciinly consists of the following three 
phases: test Ccise generation, experimentation, cmd evaluation. Based on this for- 
mcilization, we are systematiccilly addressing the following questions separately 
for each phase of the vcilidation process: 

— Which level of expertise is necessary and sufficient to supervise the corre- 
sponding phase of the validation process? 

— Eor which problem classes a module can be implemented that realizes the 
required functionality? 

— What are relevant criteria to measure the effort needed to supervise resp. 
automate parts of the validation process? Do those criteria allow for a strat- 
ification of problem classes? 



3.1 Test Data Selection 

We go only very briefly into the details of test data selection. There are several 
areas of more traditional computer science cmd of AI where the generation of test 
cases or test sets plays an important role. The methodologies invoked range from 
sophisticated mathematical considerations to comprehensive investigations tak- 
ing aspects of cognitive psychology into ciccount. We are awcire of the narrowness 
of our present approcich, but we had to trade generality for precision. 

First, we introduce a notion for the collection of cill those lecirning problems 
for which some complete data selection exists. 

Definition 10 (CDS). Let U CTZ. U e CDS iff there is a test data selection 
Ds which is complete for U. 

Let A be an oracle cmd let U CTZ. We use the notation [U € CDS]"^ to indicate 
that there is an T-computable data selection Ds which is complete for U. 

In most investigations, it is of a pcirticular interest to find out whether or not 
sets of test cases Ccm be generated automatically. Within the techniccil terms of 
the present approach, this is the question for the computability and relativized 
computability of the data selection function Ds, respectively. 

Theorem 11. Let U CTZ. Then, for any oracle A, [U € CDS]^ 4=^ [U € NUM!\^. 
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Theorem 11 yields the following corollary which exhibits the restrictiveness 
of areas in which the selection of relevant test data Ccin be fully automated. 

Corollary 12. CDS = NUM! 

Fiucilly, let us point to the following result stating that there is no level of 
expertise that allows to generate complete test data for all learning problems. 

Theorem 13. For any oracle A, there is some U £LIM such that [U ^ CDS]"^. 



3.2 Experimentation 

Recall that an experimentation is Ccilled insistent, exactly if it always gucircmtees 
to take system outputs into account, in Ccise those cire eventually generated. 

The question considered in the present section is how to implement any form 
of control to ciccomplish insistent experimentation. Conceptually, one needs any 
module supervising experimentation and ‘telling’ the validator whether or not 
(s)he should wait a little longer for some system’s response. 

Any given learning function S under vcilidation is effectively computable and, 
therefore, when being subject to experimentation, may be understood as some 
particulcir (ps with (ps = S. Note that this does not mean that the validator is 
necessarily awcire of the particular progrcim s under inspection. However, the 
actual experimentation process is characterized by the computation time of ps 
which can be suitably formalized by cmy related Blum complexity mecisure. 

In order to formalize control concepts of insistent experimentation, it is nec- 
essary to distinguish between so-called ‘white box’ validation cmd ‘blcick box’ 
Vcilidation (cf. [8], e.g.). In the first case, one has access to the program under 
Vcilidation, wherecis one is restricted to only the program’s behaviour, in the 
latter case. This is formally reflected by a control function c which depends ei- 
ther on both the information Pj[t\ presented and the program s inspected or 
exclusively on the recent information pj [t] . 

Definition 14 (Control). Let c e and let s € IN. Then, control c cillows for 
an insistent white box experimentation with (/^-program s iff, for any Pj[t\ gM, 
‘Ps{‘Pj[t])i implies c{pj[t],s)><ps{Pj[t]). 

Let cgF and let s € IN. Then, control c cillows for an insistent black box 
experimentation with (/^-program s iff, for any pj[t\^lN, Ps{Fj\A)i implies 
c{pj[i\)>4>s{pj[i\). 

Let c^F^. Then, COP^{c) is the set of cill (/^-programs controlled by c, 
accordingly. Furthermore, COF^(c) is the set of cill computable functions for 
which there is a (/5-progrcim controlled by c, i.e. COF^(c) = {ps \ s € (7(9P™(c)}. 
Concerning insistent black box experimentation via some control c^Ft, the sets 
COP^{c) and COF^{c) are defined analogously. 

The following lemma justifies the focus of the investigations on control func- 
tions. As it turns out, insistent experimentation Ccm be replciced by control func- 
tions. So from now on, instecid of investigating some abstrcict experimentation, 
insistent control functions will be cmalyzed. 
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Lemma 15. For any learning device S, the following statements are equivalent: 

(1) There exists an insistent experimentation for S . 

(2) There is a with S& COF'^{c). 

(3) There is a cGFt with S G COF^{c). 

Subsequently, we study to what extent insistent experimentation can be im- 
plemented. 

It is well known that there are arbitrary complex programs. In other words, 
for each recursive bound, there cire infinitely many totcil recursive functions that 
have a progrcun which exceeds this bound (cf. [3]). Thus, one may expect that 
insistent experimentation for larger classes of inductive learning devices requires 
some non-recursive expertise. 

A prominent example for non-recursive expertise is the hcilting set H = {{i,x) \ 
i,x G IN, ipi{x) J,}. As we will see, the hcilting set H excictly characterizes the level 
of non-recursive expertise which is both necessary cmd sufficient to ciccomplish 
white box experimentation for all learning devices. In order to verify the cor- 
rectness of this statement, some additional notation is needed. 

Let A be cm orcicle, Q CM, and SCV. Then, the notation [Q € COP^]"^ and 
[<S e COF'"]^ indicates that there is an A-computable control c allowing for an 
insistent white box experimentation with all programs in Q and all learning de- 
vices in S, respectively. We cidopt these notations for blcick box experimentation. 

Theorem 16. Let A be any oracle. Then, [I-p € COP^]"^ 4=^ [H is recursive]"^. 

In the black box approach, the situation changes drastically. In this setting, 
the control c does not receive any information about the progrcun it is supposed 
to control. Thus, the result from [3] already mentioned at the beginning of this 
subsection immediately allows for the following insight. 

Theorem 17. There is no oracle A such that [P € COP'’]"^. 

From the above result, we may easily conclude that, in the blcick box setting, 
the cmalogue of Theorem 16 does not hold. 

Corollary 18. There is no oracle A such that [Ij> G COP’’]^. 

Next, we investigate the power and limitations of computable control. As we 
shall see, it is still impossible to control all programs for any given total recursive 
learning device. 

Corollary 19. Let S gTZ be any learning device and let cgTZ be any control. 
Then, there is a program s for S with s ^ COP’^(c). 

In the white box case, the same insight can be achieved. However, the justi- 
fication is a little bit more complex. 

Theorem 20. Let S gTZ be any learning device and let cgTZ“^ be any control. 
Then, there is a program s for S with s ^ COP^{c). 

Naturally, the question arose to chcircicterize the class of progrcuns a single 
control can handle. Our next theorem gives an answer to this question. 
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Corollary21. Let and let S= COF™ {c). Then, there is some S' G NUM 

such that LIM{S)CLIM{S'). 

Proof. Let S and c be given. First, we define an appropriate class of learning 
devices S' € NUM. Let s € IN. For all f[n] € IN, we set: 

i, c-bounded 

rsU [ \; s . otherwise 

A pair {s,f[n]) is said to be c-bounded if (ps{f[x]) < c{f[x],s) for cill x<n. 

Clearly, the function class S' = {4>s |seFV} is enumerable. Thus, it remains 
to verify that LIM{S) C LIM{S'). This can be done cis follows. 

Let s be any program controlled by c. Clearly, it suffices to verify that 

LIMi^,) = LIM{^,). 

First, let / € LIM{(ps). Therefore, for cill n € IN, the pair {s,f[n]) is c-bounded, 
and thus ips{f[n]) = ips{f[n]). Hence, f G LIM{'tps), too. 

Second, let g G LIM{'tps)- Then, we know that the sequence {ips{9[n]))neiN of 
hypotheses generated by ips, when successively fed information about g, must 
converge. By definition of ipg, this implies that, for all n^IN, the pciir {s,g[n]) 
has to be c-bounded. Therefore, (ps{g[n]) = ips{g[n\) for cill n^IN, cmd, since 
g £ LIM{ips), 9 & LIM{(fis) sts well. □ 



3.3 Evaluation Expertise 

Within the Icist two subsections, we have investigated the problem of automating 
the data selection and the experimentation. Now, we focus our attention on 
the evaluation phase. The next definition provides the formal framework for an 
appropriate investigation. 

Definition 22 (EVAL). Let ID be an identification type, let <SC7^ be a col- 
lection of learning devices, and let CID be a collection of lecirning problems. 
Then, the lecirning devices in S are said to be ID-evaluable with respect to all 
learning problems in U ((ID,<S,ff) SEVAL, for short) iff there is an evaluation 
Eval£pp such that, for all SgS, cill U £U, cill data selections Ds complete 
for U, and all experimentations Expmt insistent for S, the resulting validation 
dialogue VD = {Ds, Expmt, Eval) is successful for U and 5 iff 17 CID(5'). 

In case that there is an evaluation Eval that witnesses that the overall collec- 
tion of all learning devices V is ID-evaluable with respect to cill possible learning 
problems in ID, we use the shorthand ID SEVAL instead of (ID, 7^, ID) SEVAL. 

Let A be cmy oracle. Then, if there is an A-computable evaluation Eval 
witnessing (ID,<S,7/) SEVAL for an identification type ID, some class of learn- 
ing devices S, cmd some collection U of lecirning problems, this is expressed cis 
[(ID,5 ,W)gEVAL]^. 

The key question considered next is, given some lecirning type ID, how pow- 
erful an expert must be to ID-evaluate all computable learning devices with 
respect to all learning problems in ID. 

Formed concepts Cem be invoked to implement the following program: 
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(i) Choose the requirements an acceptable learning device should meet. For 
instance, it should learn in the limit or should finitely lecirn. 

(ii) Find some characterization of expertise. 

(iii) Prove a theorem that cmy expert who is competent according to the con- 
ditions of (ii) is, therefore, able to truly evaluate whether or not any given 
learning device meets the requirements focused under (i) when being con- 
fronted with such a learning problem. 

(iv) Prove a theorem that an expert’s ability to evciluate all devices with re- 
spect to the requirements fixed within (i) necessarily needs some skill cis 
formalized within (ii). 

The problem of determining in the limit, whether or not an arbitrary com- 
putable function is total, seems to play a key role in validating learning systems. 

Proposition 23. Let A he any oracle. Then, [LIMe EVAL]^ 4=^ [In l.r.e.]^. 

Proof. We stcirt with the following claim. 

Claim 1. [In l.r.e.]^ [LIM^EVAL]^. 

Let S be any learner and let U be any lecirning problem in LIM. Further- 
more, we assume an insistent experimentation Expmt for S and a complete data 
selection Ds for U. 

In the sequel, we are going to illustrate the way in which an expert might 
exploit the assumed expertise for systematic validation. Let d be a function that 
limiting enumerates In- For each nGlN, let Pn be the set of cill protocols for 
Ds(n), that is Pn = {{j,t,h) [ {j,t) € Ds{n), Expmt{(pj[t]) = h}. 

For cmy report (j,t,h) GPn, let Pn{j,t) denote the set of protocols for pro- 
gram j having a time stamp smaller them t, i.e. Pn{j,t) = ,h') € | f' < f}. 

Furthermore, we say that Pn{j,t) is nice, provided that Pn{j,t) is non-empty and 
h = h, where h is the hypothesis documented in the report with the mciximcil time 
stcunp in Pn{j,t). 

The evaluation of any particular protocol p= (j,t,h) GPn is done cis follows. 
We distinguish the following cases: 

(i) If (A), (B), cmd (C) are satisfied, then set Eval((j,t,h)) = l. 

(A) For all {j,t' ,h') € Pn{j,t), h' 

(B) Pn{j,t) is nice. 

(C) d{h,t) = 1 and, for cill x<t, 

<Ph {x) < t implies (ph (x) = pj (x) . 

(ii) Otherwise, set Eval{{j,t,h)) =0. 

First, assume that S is learning some tcirget function / in the limit. By 
assumption, there is some program for /, say j, such that, for cill f € IV, the report 
(j,t,h) will be presented. Moreover, since the underlying experimentation Expmt 
is insistent, we know that h equcils S{pj[i\). By Definition 1, S is defined for all 
initial segments of /. Thus, (A) is always true. Since S, when lecirning /, performs 
at most finitely many mind chemges, (B) is cilmost cilways true. Furthermore, S 
converges to some correct </5-index for /, say h' , cmd therefore h' is almost always 
the input for the test performed in (C). Since ifk.i =/, and therefore ifk.i GiZ, (C) 
is almost always true, too. Hence, Eval{j,t,h) equals 1 for cilmost cill time stamps 
t, cmd we are done. 
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In contrast, if S fciils on some function /, then one of the following three 
events must happen, (i) S does not return a hypothesis, for at lecist one test 
data input, (ii) S does not converge, (iii) S does converge, but to some final 
hypothesis h" not correctly reflecting the target function /. In case that (i) 
happens, the validation dialogue does not succeed because of (A). In Ccise (ii) 
happens, (B) prevents Eval from successfully converging to 1. Finally, if (iii) 
happens, it will be recognized that at lecist one of the following cases occurs: 
^h” {x) 7^ f{x), for some x € IV, or iph" ^ Ti-- In any of both Ccises, the validation 
dialogue will not converge. Hence, the claim follows. 

Claim 2. [LIMgEVAL]^ => [In l.r.e.]^. 

Fix any non-empty U £LIM, any learning device S with U CLIM{S), any 
target function g &U, and any kGlN with (fk = g. For every j € IV, let (fj be the 
following function. For all x gIN, it holds: 

^ (undefined : otherwise 

Clearly, for all j € IV, (pj equals g if and only if pj € IZ. 

Furthermore, for all j € IV, let Sj be a learning device which is defined as 
follows. For all f[n] € IV, we let: 

All \) (^(/[n]) : otherwise 

Obviously, for all j € IV, we have LIM{Sj) = LIM{S), and thus Sj learns g, if and 
only if ipj equals g if and only if ipj € IZ. 

Now, we cire ready to show that the evaluation Eval can be used to lim- 
iting enumerate In- Recall, that ipk=g- For cill j,nGlN, we specify d{j,n) = 
Eval{k,n,j). 

We claim that d limiting enumerates In- 

Suppose cmy j € IV. We have to show that the subsequence {Eval{k,n,j))neiN 
converges to 1 if and only if ipj € IZ- Since, by construction, ipj GiZ li and only 
if Sj learns g, it sufflces to show that \im{Eval{k,n,j))neiN= 1 if and only if Sj 
learns g. This can be seen as follows. 

First, suppose that Sj learns g. Then, LIM{S j) = LIM{S) - Now, since Eval 
is witnessing [IIMeEVAL], by assumption, the subsequence {Eval{k,n,j))nem 
has to converge to 1. 

Second, suppose that Sj does not learn g. Now, fix cmy validation dicilogue 
VD for S cmd U- Clearly, this dicilogue contains at most finitely mcmy reports 
with the evaluation 0. Now, delete all reports {{k' ,t,h),b) with ipk' =g- Obviously, 
the remaining dialogue VD' also contciins at most finitely many reports with the 
evaluation 0. Now, consider the dicilogue VD" = VD'u{{{k,n,j),Eval{k,n,j)) \ 
n e IV}. (Note that one can ecisily fix some test data selection complete for U and 
some experimentation insistent for Sj that, together with Eval, result exactly 
in this particular validation dialogue.) Clearly, the dicilogue VD" must contain 
infinitely mcmy reports with the evaluation 0, since Sj fails to learn g. Obviously, 
this can only happen, if there cire infinitely many n with Eval{k,n,j) = 0- This 
finishes the proof of the claim, and the theorem follows. □ 
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Proposition 23 characterizes validation expertise. Next, we attcick the general 
problem to relate validation expertise to domain expertise, i.e. the ability to 
solve learning problems in the required sense. The following result is due to 
Adlemann and Blum (cf. [1]). 

Proposition 24. Let A be any oracle. Then, [Iji l.r.e.]^ 4=^ [JZ&LIM\^. 

Putting the last two results together, we arrive at the following insight. 

Theorem 25. Let A be any oracle. Then, [LIM^EVAL]-^ 4=^ \TZ&LIM\^. 

Consequently, an expert who has the ability to T/M-evaluate all lecirning 
devices has a level of expertise which is sufficient to solve every possible lecirning 
task, i.e. to lecirn in the limit every total recursive function from input/output 
examples. 

The ideas underlying the proof of Proposition 23 apply mutatis mutandis to 
elaborate the following result. 

Proposition 26. Let A be any oracle. Then, 

(1) [TOTAL&EVAL]^ ^ [/tj l.r.e]^. 

(2) [EIN&EVAL]^ 4=^ [In l.r.e]^. 

Our hucil result in this subsection summarizes the insights obtciined so far. 

Theorem 27. Let A be any oracle. Then, the following statements are equiva- 
lent: [LIM£EVAL]^, [TOTALS EVAL]^, [EINgEVAL]^, and [n£LIM\^. 

Although different types of learning behaviour may require different valida- 
tion approaches, the needed level of expertise turns out to be the same. 

Subsequently, we would like to direct the reader’s attention to the following 
problems: 

(A) Imagine one has to solve a pcirticular learning problem. Chcircicterize the 
level of expertise that is necessary cmd sufficient to figure out which learn- 
ing devices cire able to solve the learning problem on hand. 

(B) Imagine someone is the provider of a particulcir learning system. Char- 
acterize the level of expertise that is necessary and sufficient to evaluate 
whether or not the system provided is able to solve the learning problems 
of some potential costumer. 

In investigating these problems, we confine ourselves to study LIM-type eval- 
uation, only. 

Note that most of the results may be easily adapted to handle the TOTAL- 
and EIN-case, as well. 

Having a closer look at the demonstration of Proposition 23, one might easily 
recognize that the verification of Claim 2 is supporting the following considerably 
stronger result. 

Proposition 28. Eor any oracle A and any U £ LIM, [{LIM,V,{U}) £EVAL]^ 
implies [In l.r.e.Y'-. 
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Concerning question (B), the situation is much more involved. As it turns 
out, the answer to this question heavily depends on the properties of the favoured 
learning device. 

Proposition 29. Let S gV such that LIM{S) € NUM!. Then, there is some eval- 
uation Eval witnessing {LIM,{S} ,LIM) € EVAL. 

As we have seen, there are lecirning devices that can be L/M-evciluated by 
an expert without any additioucil non-computable expertise. Interestingly, the 
opposite extreme can be observed, as well. In order to evciluate a particular 
learning device, expertise is needed that allows to L/M-evciluate every learning 
strategy with respect to all learning problems. 

Proposition 30. There is some learning device S gV such that, for any ora- 
cle A, [{LIM,{S},LIM)eEVAL]^ implies [In l.r.e.]^. 

4 Conclusions 

Let us very briefly sum up the technical contents of the present paper. We know 
about sufficient and necessary expertise to accomplish some vcilidation tasks. 
Interestingly, this expertise can not be automated. The strength of the expertise 
is illustrated by the following informed statement: Who is able to validate certain 
learning devices, is also able to replace them in solving learning problems. 

Evidence for this thesis is provided by severed of our results above. There are 
some results illuminating the necessity to have cm expertise formally expressed 
by the oracle H, i.e. by the power to decide the halting problem. In case this 
power is available, it is immediately possible to lecirn in the limit any recursive 
function: [IZg LIM\^ . 

However, these remarks refer only to the technical perspective of our present 
paper. Our stcirting point wcis more general. 

The vedidation of complex systems is a remarkably urgent problem area. 
Several vedidation approaches and scenarios are recently under development, 
under theoretical investigation, and cdso under experimental exploration. 

As soon as humcm experts cire becoming involved, the question for the ex- 
perts’ competence is becoming crucial. Most problems, even some very funda- 
mental one, cire still open. A quite typical question is how the validation experts’ 
expertise relates to the domain experts’ skills. Is it necessary that anybody in- 
volved in systems vedidation needs to be quedifled for doing the systems’ job, at 
least in principle? Or does a substanticdly lower degree of qualification sufflee for 
vedidating a system’s behaviour? 

There might be no generally valid answers to those questions. Despite this, 
any clear answer derived under certain more specific circumstances might be 
discussed controversially. Therefore, a Arm justification is both theoretically cmd 
practically relevant. The present paper is intended to provide an example, only. 
There is evidence for the thesis that vedidation is not simpler them doing the job 
itself. 
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Abstract. This paper aims to extend well known results about poly- 
nomial update-time bounded learning strategies in a recursion theoretic 
setting. 

We generalize the update-time complexity using the approach of Blum’s 
computational complexity measures. 

It will turn out, that consistency is the first natural condition having 
a narrowing effect for arbitrary update boundaries. We show the exis- 
tence of arbitrary hard, as well as an infinite chain of harder and harder 
consistent learnable sets. The complexity gap between consistent und in- 
consistent strategies solving the same problem can be arbitary large. We 
prove an exact characterization for polynomial consistent learnability, 
giving a deeper insight in the problem of hard consistent learnability. 



1 Introduction 

Since Gold’s definition of identification in the limit inassi, there has been ex- 
tensive research on the problem of what can be learned and what cannot. A 
huge number of learning classes were defined, applying more or less natural 
constraints to the basic LIM definition. The resulting classes as well as their 
relationships are well explored. On the other hand very few results about the 
complexity of identification in the limit are known. There seem to be two types of 
complexity theoretic results. A) General approaches which apply to sets of learn- 
ing problems |DS86IFKS93IFKS95I,JS95| and B) specific approaches, which only 
apply to specific problems in specific hypotheses spaces or for specific strategies 
fWL7filWZ94IFit89llsh9QIWat94IKla94IFW9QIKin94iLZ95IHDGW94j 

From the computational complexity theory it is well known, that a poly- 
nomial complexity bound naturally means a good or fast solution. All general 
approaches for learning complexities do not have this ’’natural” property. Their 
complexity upper bounds are at least polynomials. On the other hand, most of 
the specific approaches satisfy the above ’’natural” condition. 

Pitt in pit89j reviews some definitions of identification in the limit and dis- 
cusses the problem of augmenting the definitions in order to incorporate a notion 
of computational efficiency. He summerized his discussion: ’’finding a natural for- 
mal definition that captures the notion of (polynomial) efficient inference ... is 
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not at all straightforward” . Comparing his results for learning DFA with those 
of Lange, Wiehagen and Zeugmann |LW91IWZH2| (learning pattern languages) 
and Valiant IV^ (PAC learning), we found one common constraint for their 
polynomial learning strategies: consistency. 

In this paper we examine polynomial limiting consistent identification in the 
Gold-style learning model of recursive functions. Coming from the above results 
and based on Blum’s axiomatic computational complexity theory, we explore 
the update complexity of consistent learning strategies. We will give different 
degrees of hard learnability and show the existence of an infinite complexity hi- 
erarchy. We will prove that the demand for consistency can require any amount of 
resources. Our main result is a characterization of polynomial consistent learn- 
ability. We will show the relationships between the polynomial learnable sets 
and different non-complexity-bounded learning classes. We will outline, that the 
above results also apply to other learning conditions. 



2 Preliminaries 

N = {0,1,2,...} denotes the set of all natural numbers. The set of finite se- 
quences of natural numbers is denoted by N*. The set of all partial recursive and 
recursive functions of one, and two arguments over N are denoted by V,V^,TZ, 
and TZ^ , respectively. From time to time, we equate a recursive function with 
the sequence of its values. For arbitrary f G V and a; S N, we write f{x) {, to 
denote that f{x) is defined. Let f G V and a G N*; we write a C / iff a is a 
prefix of the sequence of values associated with /. By o;(a;) we denote the x-th. 
element in the sequence a. Let /, g G 7^, n G N, we write / =„ g iff for all x < n 
f{x) = g{x). Any function i[> £ is called a numbering, or a programming 
system, ipi abbreviates Xx.tjj{i,x). is the set of all '0i- A numbering cp G V'^ 
is called a Godelnumbering iff V^p = V and for any numbering ij) G V^, there 
is a c G 7?. such that ipi = (pc(i) for all i G N. The tuple {ip, <P) is a Blum com- 
plexity measure | |Bluti7| iff is a Godelnumbering, ^i(x) { Pi{x) {, and 

<Pi{x) = y is recursive in i,x, and y. For any S G V we write to denote 

for an i providing ipi = S. Time and memory complexity will be called natural 
complexities. Poly^ is the set of all ipi such that is a polynomial. We write 
Poly, as a placeholder for any polynomial. For any function f G TZ, f is said 
to be unbounded monotone increasing iff for all x G N /(x) < f{x + l) and 
lirux^oofix) = oo. Using a fixed 1- 1-coding d{. . .) of N* onto N, we write /" 
instead of d(/(0), . . . , f~^ is defined to be the empty sequence. We also 

write /", to abbreviate the sequence (/(O), . . . , /(n)). The usage will be clear 
from the context. Proper subset, subset, superset, and proper superset will be 
denoted by C, C, u, and A, respectively, min, max, and length are functions for 
the minimum, maximum and the length of a sequence. For the empty sequence 
0, we define max{) = 0,min{) = oo, and length() = 0. By V, 3,V°°, and 
we denote the quantifiers for all, exists a, for all but finitely many, and exists 
infinitely many, respectively. 
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3 Definitions and Basic Properties 

Following Gold and Solmonoff jSol64j . we define an identification in 

the limit process as follows: A strategy is presented with some total recursive 
function / by feeding successively growing parts of its graph /°, f^,. . . , /”. At 
each time n, the strategy has to make a guess, i, as to the identity of /. The 
guess i is interpreted as a program for / in an underlying programming system 
-ip, called the hypothesis space. A strategy learns /, iff the hypotheses generated 
on /" converge to an i satisfying ipi = f . K set of recursive functions U is called 
to be learnable in the limit with respect to a strategy S G V and a hypothesis 
space Ip GV'^ (write U G LIM,/,(S')) iff the strategy can learn all functions in the 
set. The set of all sets of recursive function learnable in the limit will be denoted 
by LIM (:= {U C 7^|3S' e V ,3ip G : U G LIM,/,(5)}). 

Learning takes place, if, after reaching the (unknown) point of convergence, 
the strategy outputs the correct function inferred from the knowledge about 
a finite part of the infinite object. A main problem in LIM learning is, that 
every temporary hypothesis can fool the user as much as it wants. Scientists 
tried to solve this problem, by applying more or less natural constraints to 
the LIM-definition, exploring the narrowing effect of learnability. The resulting 
learning classes (e.g. FIN, CONS, NUM, ...) as well as there relationships are all 
well explored pEHU. One of those ’’natural” learning classes is CONS. A set of 
recursive functions, U, is consistent learnable with respect to a strategy S G V 
and a hypothesis space ip G {U G CONSy,(S')) iff a) f7 S LIM,/,(S') and b) 
every hypothesis produced during the learning process of one of the functions 
in U, is consistent with the graph seen so far, formally V/ € C Vn S N : [/” C 
ips(f <>.)]■ Again, we define CONS to be the set of all consistent learnable sets of 
recursive functions. 

Consistency has a practical aspect: the user of a consistent learning system 
can be sure that the hypothesis is of good quality, that is, it has a total recall of 
the presented examples. This seems to be a quite natural condition for learning 
strategies. At the first glance, there seems no reason to produce inconsistent 
hypothesis. Nevertheless, Barzdin first in^FTi announced that there are classes 
of recursive functions that can be learned in the limit but only by strategies 
working inconsistently. There are two characterizations known for consistent 
learnability, giving a deeper insight in this phenomenon. 

Theorem 1. U G CONS 4=^ 

3ip G V^3g G 7Z^ : (a) U C and 

(b) Vi,j,nG N : [g{ij,n) = 1 ipi =„ ipj] 



Theorem 2. U G CONS 4 =^ 

3ip G : U -consistency is decidable in ip 4=^ 

3ip G V^3g G V'^ : (a) U C and 

(b) V/ G C, Vi, n e N : g{f^, i) i and 

[g{r,i) = i ^ rcip,] 
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We still mentioned another phenomenon about consistent strategies: To the 
best of our knowledge, consistency is the only condition having a narrowing effect 
for polynomial update-time bounds. 

The following will investigate the problem of defining resource bounded con- 
sistent learnability and finding a characterization for fast consistent learnability 
in Gold’s model of learning in the limit. 

The learning complexity measure we use, is based on the amound of resources 
needed by a strategy to update an hypothesis between two successive inputs. This 
inference complexity is well known as update complexity. It was shown jl)SSfi| 
that a polynomial update-space boundary does not effect the learning power 
of learning in the limit. On the other side, Wiehagen and Zeugmann fWTTT^ 
provided a learning problem such that using a fixed hypothesis space, every 
consistent learner must exceed any polynomial time bound for its updates on at 
least one object to be learned, unless P — NP. Pitt proofs a similar result 

for learning DFA: using the hypothesis space of DFA’s, there is no consistent 
strategy learning DFA’s fast. 

We will generalize the above results in two ways: First, we allow the strategy 
to use arbitrary hypothesis spaces, and second, we use the general approach of 
Blum’s axiomatic complexity theory to define the update complexity. 

Let {ip, <P) be an arbitrary Blum complexity measure, I a learning class (e.g. 
FIN, CONS, LIM, . . . ) and p a recursive function. A set U of recursive functions 
is said to be X-learnable within a bound p of ^-resources via strategy S G V 
and programming system 'ip {U G p-<I>-I^{S)) iff a) U is I-learnable via S 

and 'ip, and b) for all n, S fed with some f^CfGU consumes less than p{f^) 
^-resources. 

In most cases we will restrict ourselves to special natural Blum complexity 
measures, such as time and memory resources. Those have the nice propertie, 
that the amount of resources used to run two programs successively, can be 
estimated by the sum of resources needed to run each program seperatly. Note 
that this is not true for arbitrary Blum complexity measures. 

Definition 1 {p-<P-J). 

Let p G TZ be a reeursive fu'nctio'ti (ealled eo'mplexit'y hound), {ip,L>) a Blum 
eomplexit'u measure, X a learning class from {FIN, CONS, LIM, .. .}, and U a 
set of recursive functions. 

U G p-<P-I ^ 

3S = ip^GTZ,pJ G r^: (a) U G I.,p{S) 

(b)Wf GU,nG-N:[<I>,in<p{n] 

Showing that a set of functions is not p-d>-I learnable, it is sufficient to show 
that every X-learner must exceed the p bound for its update ^-complexity on at 
least one object to be learned. To prove much harder non-learnability results, 
we need the following definitions. 

Definition 2 (^-insufficient, ^-hard). 

Let p G TZ be a recursive function, {ip, <P) a Blum eomplexit'u measure, X a learn- 
ing class from {FIN, CONS, LIM , . . .}, and U a set of recursive functions. 
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p is <!> -insufficient for U wrt I, iff for all strategies S = ipi and all program- 
ming systems if: 



U G ^ V/ G C/3°°n : [<?>,(/") > p(D] 

p ist <l>-hard for U wrt X, iff for all strategies S = tpi and all programming 
systems ip: 

UGi^{p,)^yf€ uy°°n : [<i>ffn > Pin] 

The following corollary follows immediatly: 

Corollary 1. For arbitrary {(p,(l>),X,U,p: 



p is F-hard for U wrt X ^ p is <l> -insufficient for U wrt X ^ U ^ p-<P-X 

For natural Blum complexity measures {p, •F) it is well known that 

any unbounded monotone increasing update boundary does not reduce the learn- 
ing power of learning in the limit: 

Theorem 3. For any unbounded monotone increasing recursive function p: 



p-<P-LIM= LIM. 

The proof is based on a simulation of the strategy S (which is known to do 
the learning task), but only consuming resources within the p-complexity-bound, 
during this simulation. In the following we abbreviate this simulation with Polyg, 
or simply with Polys, if p is not important. Note, that it is necessary to demand 
p to be an unbounded monotone increasing function, since Daley and Smith 
fn^ exhibited a hierarchy for LIM learning classes bounded by total limiting 
recursive functionals. 

Also note that this proof cannot be taken over for arbitrary Blum complexity 
measures, since there is no possibility to estimate the complexity of the simu- 
lation task. Even the well known result, that for any two complexity measures 
there is a two- valued recursive compiler, cannot save Theorem Qfor non-natural 
Blum complexities. 

Likewise, for most other learning classes, learning with few resources does 
not have a narrowing effect on the learning power of this class. 

Corollary 2. Let X e {FIN, LIM}: 



Poly-<F-X = X 



4 Consistent Polynomial Learning 

While polynomial update bounds does not effect the learning power of most 
learning classes, this is not true for consistent learnability. 

For any natural complexity measure (p, F) and recursive unbounded mono- 
tone increasing function p, it is not hard to give an example for a class, that can 
be learned consistently, but is no member of p-^-CONS. 
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Theorem 4. Let p an unbounded monotone increasing function: 

p-<P-CONSc CONS. 

Proof. Choose an arbitrary / S Let Uf = {aO°°|a C {0,1}*} U {/}. 

Since for all f,UfG CONS, it remains to prove that there is an /, such that 
Uf ^ p-<?-CONS for some arbitrary /. 

Suppose there is a strategy S £ V and a numbering such that 

Uf £ CONSy,(S'). Let c be the point of convergence for strategy S on input /, 
and let S{f^) = j. Let Table[f] be a list for the values of / up to argument c. 
Now, we define a procedure computing / as follows: 

Procedure f: Input: x 

1. if a: < c return Table[f]{x); 

2. for i = 0 to a; — 1 do compute f{i); 

3. Do (A) and (B) in parallel until (A) or (B) returns j: 

(A) evaluate S{f^~^l); 

(B) evaluate S{f^-^0); 

4. if (A) evaluates to j return 1; 

5. if (B) evaluates to j return 0; 

It is easy to see that procedure f really computes the function /. The amount 
of resources needed for evaluating the values of / for arguments 1 to c — 1 is a 
constant, say g. Since <P is natural we can estimate the amount of resources for 
all values x > c : 

X 

^f{x) <g + '^2<Ps{f) 

i—c 

Since / is choosen arbitrarily, we can assume ^/(x) > infinitly 

many x. 



^3°°x-.g + Y, Mf) < < <Pf{x) <g + Yl 2-Psif) 

i—c i—c 



^3^x-.Y.p{n<Y,<Psin 

i—c i—c 

^3°°x:p{n<<i>s{n 

^U ^ p-^-CONSv,(5) 



Repeating the arguments of Theroem 0 we can point out that the above 
techniques to prove Theorem 0 cannot be used for arbitrary Blum complexity 
measures. 

A simple modification of Theorem El can be used to prove the existence of 
arbitrarily ^-insufficient, as well as, arbitrarily ^-hard consistent learnable sets. 
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Corollary 3. For any unbounded increasing recursive function p, there is aU C 
TZ, such that p is <?-insufRcient for U wrt CONS. 

Proof. Choose / G TZq_i such that ^/(x) > g + J2i=c 2p(/*) infintly many x. 
Set U = {g G TZop\y°°x : f{x) = g{x)}. The corollary follows immediately from 
Theorem El t] 



Corollary 4. For any unbounded increasing recursive function p, there is aU C 
TZ, such that p is <?-hard for U wrt CONS. 



Proof. Suppose / G TZ^q i^ is a function, such that for all but finitely many x, 
‘Ff{x) > q{f^), for some q G TZ. Let U = {g G 7?.o,i|V°“x : /(x) = g(x)}. Using 
the simulation-technique of Theorem 0 we can show that for any strategy S and 
any programming system fj, iiU G CONS,/, (S'), then for all but finitely many x, 

^s(f^) > Since for any p G TZ, there is a corresponding q G TZ, 



such that for all n > 1: p(n) < 



g(r)-9(r~ -1) 



, the corollary follows immediately. 



Now we are ready to prove an infinite chain for resource bounded consistent 
learning classes. 

Theorem 5. 

There is an infinite chain of recursive functions (pi)igN such that for all i: 



p,-<P-CONS C p^+l-<P-CONS 

Proof. Start with some recursive function pq. Use Theorem^to prove the exis- 
tence of a set Uo such that Uq G p-^-CONS \ CONS. Since Uq G CONS, there 
is a strategy Sq (for example the enumeration strategy) learning Uq. Choosing 
Pi = ^So> it is clear that po-^-CONS C pi-<?-CONS. Repeat the above construc- 
tion using Pi to find p 2 and so on. t] 

Corollary 5. 



Poly-<P-CONS C CONS 

Up to now, the proofs of all theorems and corollaries are based on a simple 
simulation-technique to estimate the resources needed to compute a function. We 
already mentioned, that this technique is not available for arbitrary Blum com- 
plexities. Anyhow, we are able to prove Theorems El and El as well as Corollaries 
El and El if we do not restrict ourselves to natural Blum complexities. 

The proof is an extention of Blums computational theoretic counterpart 
EESZI: Given any total recursive function p, and any Blum complexity mea- 
sure {<p,(F), there is a subset, M, of the natural numbers, whose characteristic 
function has at least a fixed ^-complexity. But while Blum could use a simple 
diagonalization technique to prove the existence for arbitrary hard functions, we 
need something like a limiting diagonalization technique, that is, the diagonal- 
ization is also a limiting process. 



Consistent Polynomial Identification in the Limit 



431 



In the following we use any Blum complexity measure (tp, <?) and recur- 
sive function p. We will construct an infinite set of functions, for which every 
consistent strategy using any hypothesis space, and needing everywhere few 
resources, fails infinitely many times to converge to a (correct) hypothesis, for 
at least one function to be learned. First, we will show, how the general limiting 
diagonalization process will work for a fixed strategy. Note, that it is useful to 
demand the strategy to be consistent on all possible sequences of natural num- 
bers. This forces the strategy to be a recursive function and to produce different 
hypothesis for different sequences. To abbreviate the notation, we define a pred- 
icate F)*(/") to hold iff t e {0, 1} and strategy pj on input /”t consumes not 
more than p(/”t) ^-resources. holds iff holds, either for t = 0 or 

t = 1. Now, fix an arbitrary strategy S = ipj- Suppose function / is defined up to 
argument n. The value for the next argument n-|-l will be either 1 or 0. If Fj^f'^) 
does not hold, that is, if we cannot extend /”, such that strategy S produced a 
hypothesis within the given complexity bounds, the value of /(n-|- 1) can be set 
to whatever we want, say 1. On the other hand assume wlog Fj(ri) holds and the 
strategy is able to produce a hypothesis, say S'(/”l) = h' . The first time (and 
all odd times, if h equals *) this happens, we allow the strategy to produce this 
hypothesis, setting /(n-|- 1) = 1, but we save the hypothesis, in a global variable 
h. The second time (and all even times, if h is a natural number) this happens, 
we remember the last hypothesis {h) and compare it with the new hypothesis 
{h'). If h' yf h, we simply extend /” by 1. In the other case, if h' = h, we set 
f{n -I- I) = 0, forcing the strategy which is consistent on both sequences, /”0 
and /”!, to produce a hypothesis different from h' . In both cases the strategy 
must change the hypothesis. 

The above construction for f(n + I) is well defined for arbitrary strategies. 
Moreover, assuming S' to be a strategy, which is consistent on arbitrary sequences 
and using few resources on almost all parts of the graph of /, leads to non 
learnability of / for strategy S. The formal definition for procedure g(/”,j, /i) 
(:= f{n + 1)) is as follows: 



Procedure g: 




0. 


Case: 


Pj(/”) does not hold: set f{n -I- 1) = 1 


1. 


Case: 


h is -k (the initial value of h): 




a. 


if Pj (/") holds then set /(n -I- 1) = 1 






else set f{n -I- 1) = 0. 




b. 


set h to be (py(/”+^). 


2. 


Case: 


the value of h is natural number: 




a. 


if Fj{P) holds 






i. if (/3y(/”l) = h then set f{n -I- 1) = 0. 






ii. if yf h then set f(n -I- 1) = 1. 




b. 


else if Fj{a,i,n) holds 






i. if (/?(/”!) = h then set /(n -I- 1) = 1. 






ii. if (py (/”)!) yf h then set f{n+ 1) = 0. 




c. 


set h to *. 


3. 


return 


f{n+l). 
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Now we are ready to prove the next theorem 
Theorem 6. For all Blum complexity measures (tp, F) and recursive functions 

p: 

p-^-CONSc CONS 

Proof. It is sufficient to show 3U S CONS: 

\/SGV,fjG 3/ G [/, 3°°n : [U G CONSv,(^) ^ <?s(/”) > piD] 

For each i G N define if a G N* satisfies d{a) = i, then initialize fi with a 
and continue to define fi using procedure g. Let U = {fi\i GN}. U is enumerable, 
hence in CONS, but no resource bounded learner which is consistent on U 
(hence on TZ) can learn fi. t] 

We even can show the existence of a single recursive function, such that all 
strategies learning this function and being consistent on all initial sequences of 
recursive function^ must waste resources for infinitly many inputs from /. 

Theorem 7. For all Blum complexity measures and recursive functions 

p, there is a single recursive function f: 

{f} ^ p-F-TZCONS 



Proof. It is sufficient to show: 

3/ G 7^, 3°°n,V5 [{/} G 7^CONS,^(5) ^ <Z>s(D > p(D] 

The difficulty is to do the diagonalization used in Theorem|^for all strategies 
within the definition of a single function. To do so, we must give any strategy, 
which seems to learn the function using few resources, the chance to define the 
next value of / infinitly often. 

We give all natural number two marks: a priority mark (pmark{x)) and an 
hypothesis mark (hmark{x)), all initially set to *. pmark(x) is the priority of 
strategie ipx (* means unused up till now), to be the next candidate to define 
the next value of /. hmark is used to remember the hypothesis used in the 
diagonaliziation process. 

Suppose / is definied until argument n. The next value f{n+ 1) is defined 
using procedure g'{f^). 

Procedure g': 

0. Set hmark(n) := * and set pmark(n) to a priority lower than all priorities 
used up to now. 

1. Choose j with highest priority, such that Fj(/") holds. If no such j exists, 
set j to 0. 

2. Set h := hmark{j). 

^ Those strategies are called 7?.-consistent. The corresponding learning class is written 
7^CONS. 
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3. Set f{n+ 1) = g{f^,j,h) (taken from Theorem|H|) 

4. Give pmark{j) a priority lower than all priorities used up till now. 

5. Set hmark(j) to the new value of h coming from procedure g. 

6. Return f{n + 1). 

pmark implements a fair selection rule, such that each strategy tpj satisfying 
Fj{f^) infinitely often, will be selected infinitely often. Thus, {/} is 7?.-consistent 
learnable, but only via strategies wasting too much resources. t] 

The next theorem proves the existence of arbitrary insufficient consistent 
learning classes, that is: 

There is a set of functions, such that no consistent learner, can learn even a 
single function from this set using few resources. In other words: For all functions 
p, exists a U G CONS, such that p is ^-insufficient for consistent U learning. 

Theorem 8. For all p G TZ, there exists a U G CONS such that: 

ySGV,^GV^:[UG CONS^(S) ^ V/ e C7, 3“n : <Z>s(D > p(D] 

Proof. For each i G N, if a G N* satisfies d{a) = i, we initialize fi with a and 
continue to define fi using the techniques in Lemma [7J The set U = {fi\i G N} 
proves the Theorem. t] 

Coding a self reference in each function allows us to show, that the speed- 
up between consistent and ” intelligent”!! inconsistent learners, can be arbitrary 
large. 

Corollary 6. There is a function q G TZ, such that for all functions p G TZ, a 
U QTZ can be found, such that: 

(a) U G CONS (even U G NUM) and 

(b) U G q-(I>-LIM and 

(c) U i (j>Nq)-<T-CONS. 

Proof. Let S'(/") = /(maa;({0} U {a; < n\f{x) > 1})) and set q = <!>s- Take U 
from Theorem El using Xx : p{x) + q{x) as the monotone increasing function. The 
conditions follow immediately from Theorem 0 t] 

Now let us return to an arbitrary fixed natural complexity measure 
and let us try to find a characterization for polynomial consistent identification 
in the limit. This will give us a deeper insight in the nature of hard learnability. 
Until now we always used, a computationally hard function in a set, such that 
consistent learnability also gets hard. But using U = {/|p/(o) = f X f ^ Poly^} 
and V = Poly^, we can prove, that computational hardness of the functions to 
be learned must not lead to hard learnability of the functions {U G Poly-CONS) 
and vice versa {V ^ Poly-CONS). 



^ intelligent means that we do not use the PoZy-Strategie of Theorem 0 which only 
slows down the normal learning process. 
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Since we already know two characterizations for consistent learnability (The- 
orem m and one could try to modify them for polynomial consistent learn- 
ability. We will see that only one of them (Theorem ^ is a good choice to give 
a new characterization. The proof of Theorem |3 is the same as it is for the 
non-complexity counterpart. 

Theorem 9. U £ Poly-<P-CONS 4=^ 

di/) £ : U -consistency is ^-polynomial in ip decidable 4=^ 

£ v‘^,p £ Poly^ : (a) U QV.^ 

(h) V/ G U,\/i,n£ N : ^g{P,i) < p{P,i)h 

[5(r,i) = l ^ rcV',] 

Nevertheless, the other characterization (Theorem EJ gave us a sufficient condi- 
tion to test polynomial consistent learnability. 

Theorem 10. U £ Poly-d>-CONS ^ 

3ip £ V'^,3g £ Poly‘S : (a) U £ 

(b) Vi,j,n£ N : [g{i,j,n) = 1 4=^ tpi =„ ipj] 

Proof. Choose U CTZ,p £ Poly, ip £V'^ , and S £V, such that U £ p-^-CONS^{S) . 
To prove Theorem EDI we need some auxiliary functions: 

ct : ct{a, S') = 1 4=^ a C ips(a) 

For arbitrary a G N* and strategies S £V, ct computes the hypothesis S{a) 
and, if defined, tests consistency, ct is undefined if either S{a) or ips{a) is 
undefined for some x < length{a). 

A : A(n, a, S) = max{k\ J2i=o ^ct(a% S) < n} 

A computes for n steps the results of S(a°), S(a^), . . . , S{a), and outputs the 
maximal length of a, for which S computes a consistent hypothesis. Note, 
that A(n, a, S) does not need more than n ^-resources. 
conv : conv{a, S) = mw{n|S(o:"') = S(a"+^) = . . . = S(a) = i} 

conv is the (temporary) point of convergence for strategy S on input a. 
conv* : conv*{n,a, S) = conv{a^^'^'°‘'^\S) 

conv* searches the (temporary) point of convergence for strategy S on inputs 
a°, q:^, . . . , It is easy to see that not much more than n ^-resources 

are needed to do so. 

Now we are ready to define a new numbering ip' . For arbitrary j choose a £ 
N*, j, n G N such that j = d{a, d{i, n)), and define: 

{ a(x) if a; < conv*{n, a, S) 

ipi{x) if a; > conv*{n, a,S) f\i = S{ip'^j~^) = S{ipp) 

I otherwise 

Ip' £ and U C is easy to see, if we know that for all x < conv*{n, a, S), 

C Ip' J holds, and ip'j £TZ if and only if '0' = ipi. 
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We define the test g as follows: For arbitrary choose the other vari- 

ables, such that i' = d{a, i, n), j' = d{P,j, m), I = conv*{n, a, S), I' = conv*{m, (3, S). 
Wlog assume I < I' . 



= 1 



k <lf\ = (3^ 
or 

< I <k< l'Aa‘ CpA = ... = S'(/3'=) = i 
or 

l'<kA C p AS{p^) = ... = S{f3k) =i 



Note, that using k = V , we even can test equivalence for all functions in ip' . 

Remember the notes we made while defining ct,X and conv*, to see that g 
is in Poly^. Moreover, we can give an upper bound for <3>g without taking the 
third argument into account. 

The last thing is to prove: p{i',j',k) = 1 4=^ =k ipy. It is useful do 
name the first, second and third condition in the definition of g, (*)/**), and 
(***) respectively. We distinguish several cases: First we assume p{i',j',k) = 1 

— Suppose holds. Then, by definition of ip' , tpp =k V'/- 

— Suppose holds. Then for all x, such that I < x < k : S{P^) = i and 

C ip^. But then C tp’., must hold, and therefore tpp =k 

— Suppose holds. Then S{a’') = i = S{P'‘) = ... = S{P^') = j and 
P^ C 1 pp. With a look at the definition of ip ’ we can see that in this case 
''Pi' = V't' so 1 p'^, =k ip'ji also holds. 

For the rest we assume p{i’,j’, k) = 0 

— Suppose k < I and ^ P’^- Then by definition of ip, '0^/ ''P'y 

— Suppose I < k and a* 2 P- Then again by definition of ip, ip'^, yffe 0'/. 

— Suppose I < k <V and C P and there is an x between I and k such that 
S{P’”) yf i. Choose x to be minimal, than ip'-, =x-i ip'j,- If y = P{x), than 
ip'i,{x) yf y, since S'(/3^) = S{ip')p~^y) yf i. It follows that 0', yf^ ip'j,. 

— The case when I' < k, C p and there is an x between I and I' such that 
S{P'") yf i, is similar to the last case. 



The above list of cases is complete and proves p(i' ,j' , fc) = 1 







Note that TheoremEJis only sufficient but not necessary, that is: 

Theorem 11. There are U G NUM\ Poly-<P-CONS,ip G V^, and g G Poly,^, 
such that U C and for all i,j, n G N.' g{i,j, n) = 1 4=^ ipi =„ ipj. 

Proof. Choose an arbitrary / ^ Poly,^. Let U = {h\3a G H* : [h = o;0°°]}. 
We already proved that U ^ Poly-^-CONS. It is left to show the existence of a 
numbering 0 such that for all i,j,n G N, ipi =„ ipj is decidable. For any A: G N 
and a G N* compute / until m ^-resources are used. Suppose we can compute 
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/^. Now, if C a0°°, define tpd{a,m) = /• In the other case 4>d{a,m) is set to 
Q!0°“. In other words: 4’d(a,m) equals o;0°° if and only if after computing the 
graph of / for less than m (^-resources, no difference between / and a°° can be 
found. 

We have to show two things: First U C and second i/'i =n V'i polynomial 
decidable. 

First note, that for all a G N*, / ^ a0°° holds. Now fixing a G N*, we can 
find /i G N such that % a0°° . Suppose can be computed with less then m 
resources, then ipd{a,m) = o;0°° . On the other hand we can easily verify that 
V'd(a,o) = /• An easy proof of ipi =„ ipj, for arbitrary i,j, n G N* is the following: 
Let i = d{a, m) and j = d{f3, k). First try to find whether = / or = o;0°° 
holds, and do the same for tpj. This can be done polynomial in i and j. 

— If both ipi and equal / then of course ■i/'i =n for all n. 

— li tjji = a0°° and ipj = /30°°, then tpi =„ ijjj can be tested via a and j3. 

— If (wlog) 'ipi = f and i>j = /30°°, then by definition of ipj we know that we 

can find a difference between / and /30°° using less then k ^-resources. Say 

h is the smallest number such that ^ f30°° . Then ipi =„ ipj 4=^ n < h. 

All cases can be verified polynomial in i and j. t] 

Last, but not least, we will insert Poly-^-CONS in the well known hierarchy 
of learning classes. 

Theorem 12. FIN C Poly-F-CONS C CONS 

5 Conclusions 

As we have seen, a natural formal definition that captures the notion of polyno- 
mial efficient inference is not at all straightforward. We proved that consistency 
is a natural condition having a narrowing effect for polynomial update bound- 
aries. At this point we will outline, that conform strategies (hypothesis must be 
consistent or can be undefined on the graph seen so far) as well as some defini- 
tions for monoton strategies have the same effect. It is an open problem to find 
a general condition that satisfies this narrowing effect for polynomial efficiency. 

The update inference complexity is based on the general approach of Blum’s 
computational complexities and thus covers a huge set of different specific com- 
plexities. Theorem El and El extend and strengthen well known results about 
the update-time and update-space of consistent strategies. Corollary El encour- 
ages programmers not only to look for consistent learning strategies, but also to 
involve inconsistent ones, since this can save arbitrary resources. Moreover The- 
oremElis an exact characterization for polynomial consistent learnability, giving 
a deep view inside the problem of resource bounded consistent learnability. 

We were not able to take over Corollary 2 for arbitrary Blum complexity mea- 
sures and it is still open if there are arbitrary ^-hard consistent learnable sets. 
The last Theorem JED joins complexity and non-complexity learning theoretic 
problems. 



Consistent Polynomial Identification in the Limit 



437 



References 



Bar74. 

BDGW94. 

Blu67. 

DS86. 

FKS93. 

FKS95. 

Fla94. 

Gol65. 

Ish90. 



JB84. 

JS95. 

Kin94. 

LW91. 

LZ95. 

Pit89. 

PW90. 



Sol64. 

Val84. 



J. Barzdin. Inductive inference of automata, functions and programs. In 
Proceedings International Congress of Math., pages 455-460, Vancouver, 

1974. E2E1 

Balcazar, Diaz, Gavalda, and Watanabe. The query complexity of learning 
DFA. NEWGEN: New Generation Computing, 12, 1994. I4’Z4I 
M. Blum. A machine-independent theory of the complexity of recursive 
functions. In Journal of Association for Computing Machinery, volume 11, 
pages 322-336, April 1967. 14251 143()l 

R. P. Daley and G. H. Smith. On the complexity of inductive inference. In 
Information and Control, volume 69, pages 12-30, March 1986. 

WMWM 

R. Freivalds, E. Kinder, and C.H. Smith. On the impact of forgetting on 
learning machines. Gomputer Science Technical Report Series GS-TR-3072, 
University of Maryland, Gollege Park, MD, 20742, May 1993. 14241 
Rtisiijs Freivalds, Efim B. Kinder, and Garl H. Smith. On the intrinsic com- 
plexity of learning. Information and Computation, 123(1):64-71, 15 Novem- 
ber 1995. 14241 

Michele Flammini. On the learnability of monotone fc/r-DNF formulae. 
Information Processing Letters, 52(3):167-173, 11 November 1994. H24I 
M. E. Gold. Limiting recursion. In Journal of Symbolic Logic, volume 30, 
pages 28-48, March 1965. 

H. Ishizaka. Polynomial time learnability of simple deterministic languages. 
Machine Learning, 5(2):151-164, 1990. Special Issue on Computational 
Learning Theory; first appeared in 2nd COLT conference (1989). 14241 142h1 
WIM 

K. P. Jantke and H.-R. Beick. Combining postulates of naturalness in induc- 
tive inference. In Elektronisehe Informationsverarbeitung und Kybernetik, 
volume 17, pages 465-484, 1984. B26I 

S. Jain and A. Sharma. The structure of intrinsic complexity of learning. 
Lecture Notes in Computer Science, 904, 1995. WfM 

E. Kinder. Monotonicity versus efficiency for learning languages from texts. 
Lecture Notes in Computer Science, 872, 1994. WfM 

S. Lange and R. Wiehagen. Polynomial-time inference of arbitrary pattern 
languages. In New Generation Computing, volume 8, pages 361-370, 1991. 

S. Lange and T. Zeugmann. Trading monotonicity demands versus effi- 
ciency. Bulletin of Informatics and Cybernetics, 27:53-83, 1995. 14241 

L. Pitt. Inductive inference, DFAs and computational complexity. In Pro- 
ceedings of the Workshop Analogical and Inductive Inference, volume 397 
of LNAI, pages 18-44, 1989. 14241 1424 1 142YI 

L. Pitt and M. K. Warmuth. Prediction preserving reducibility. J. of Corn- 
put. Syst. Sci., 41(3):430-467, December 1990. Special issue of the for the 
Third Annual Conference of Structure in Complexity Theory (Washington, 
DC., June 88). 14241 

R. Solomonoff. A formal theory of inductive inference. In Information and 
Control, volume 7, pages 1-22, 234-254, 1964. U26I 

L.G. Valiant. A theory of the learnable. In Comm. Assoc. Comp. Math., 
volume 27(11), pages 1134-1142, 1984. 14251 



438 



W. Stein 



Wat94. 

WL76. 

WZ92. 

WZ94. 



Osamu Watanabe. A framework for polynomial-time query learnability. 
Mathematical Systems Theory, 27(3):211-229, May/June 1994. 14241 
R. Wiehagen and W. Liepe. Charakteristische Eigenschaften von 
erkennbaren Klassen rekursiver Funktionen. In Elektronische Informa- 
tionsverarbeitung und Kybernetik, volume 12, pages 421-436, 1976. 14241 
R. Wiehagen and T. Zeugmann. Too much information can be too much 
for learning efficiently. In 3.rd Int. Workshop on Analogical and Inductive 
Inference, volume 642 of Lecture Notes in Artificial Intelligence, pages 72- 
86, 1992. WIfI\WT7\ 

R. Wiehagen and T. Zeugamnn. Ignoring data may be the only way to learn 
efficiently. In Journal of Experimental and Theoretic Artifical Intelligence, 
volume 6, pages 131-144, 1994. 14241 



Author Index 



K. ApsTtis 46 

S. Arikawa 247, 262 

H. Arimura 247 

J. Barzdins 400 

J. Case 31, 205, 276 

P. Damaschke 103 

F. Denis 112 

M. Fischlin 72 

R. Freivalds 46 

R. Fujino 247 

G. Grieser 409 

T. Head 191 

E. Hirowatari 262 

S. Jain 205, 276, 291 

K. P. Jantke 409 

S. Kaufmann 276 

S. Kobayashi 191 

M.K.R. Krishna Rao 143 

S. Lange 409 

A. Maruoka 127 

E. McCreath 336 

G. Melideo 87 

L. Meyer 306 

Y. Mukouchi 220 

M. Ott 31 

M.M. Richter 1 



H. Sakamoto .... 






234 


U. Sarkans 






, 400 


M. Sato 






. 220 


K. Satoh 






. 179 


A. Sattar 






.143 


M. Schmitt 






.375 


A. Sharma 


.31, 


276, 


336 


V.N. Shevchenko . 






..61 


R. Simanovskis . . . 






..46 


C.H. Smith 






...1 


J. Smotrovs 






..46 


W. Stein 






, 424 


E. Stephan 


..31, 


276, 


321 


N. Sugimoto 






.169 


E. Takimoto 






.127 


S. Varricchio 






..87 


Y. Ventsov 






. 321 


K.A. Verbeurgt . . 






.385 


A. Wataki 






.247 


R. Wiehagen 






... 1 


S. Wrobel 






.. 11 


A. Yamamoto .... 






. 158 


T. Yokomori 






191 


T. Zeugmann 






...1 


D. Zheng 






.220 


N.Yu. Zolotykh . . . 






..61 




